flat assembler
Documentation and tutorials.

flat assembler g
Introduction and Overview

Table of contents

What is flat assembler g?

It is an assembly engine designed as a successor of the one used in flat assembler 1, one of the recognized assemblers for x86 processors. This is a bare engine that by itself has no ability to recognize and encode instructions of any processor, however it has the ability to become an assembler for any CPU architecture. It has a macroinstruction language that is substantially improved compared to the one provided by flat assembler 1 and it allows to easily implement instruction encoders in form of customizable macroinstructions.

The source code of this tool can be compiled with flat assembler 1, but it is also possible to use flat assembler g itself to compile it. The source contains clauses that include different header files depending on the assembler used. When flat assembler g compiles itself, it uses the provided set of headers that implement x86 instructions and formats with a syntax mostly compatible with flat assembler 1.

The example programs for x86 architecture that come in this package are the selected samples that originally came with flat assembler 1 and they use sets of headers that implement instruction encoders and output formatters required to assemble them just like the original flat assembler did.

To demonstrate how the instruction sets of different architectures may be implemented, there are some example programs for the microcontrollers, 8051 and AVR. They have been kept simple and therefore they do not provide a complete framework for programming such CPUs, though they may provide a solid base for the creation of such environments.

There is also an example of assembling the JVM bytecode, which is a conversion of the sample originally created for flat assembler 1. For this reason it is somewhat crude and does not fully utilize the capabilities offered by the new engine. However it is good at visualising the structure of a class file.

How does this work?

The essential function of flat assembler g is to generate output defined by the instructions in the source code. Given the one line of text as shown below, the assembler would generate a single byte with the stated value:

        db 90h

The macroinstructions can be defined to generate some specific sequences of data depending on the provided parameters. They may correspond to the instructions of chosen machine language, as in the following example, but they could as well be defined to generate other kinds of data, for various purposes.

        macro int number
                if number = 3
                        db 0CCh
                else
                        db 0CDh, number
                end if
        end macro

        int 20h         ; generates two bytes        

The assembly as seen this way may be considered a kind of interpreted language, and the assembler certainly has many characteristics of the interpreter. However it also shares certain aspects with a compiler. It is possible for an instruction to use the value which is defined later in the source and may depend on the instructions that come before that definition, as demonstrated by the following sample.

        macro jmpi target
                if target-($+2) < 80h & target-($+2) >= -80h                    
                        db 0EBh
                        db target-($+1)
                else
                        db 0E9h
                        dw target-($+2)
                end if 
        end macro

                jmpi start  
                db 'some data'  
        start:

The "jmpi" defined above produces the code of jump instruction as in 8086 architecture. Such code contains the relative offset of the target of a jump, stored in either single byte or 16-bit word. The relative offset is computed as a difference between the address of the target and the address of the next instruction. The special symbol "$" provides the address of current instruction and it is used to calculate the relative offset and determine whether it may fit in a single byte.

Therefore the code generated by "jmpi start" in the above sample depends on the value of an address labeled as "start", and this in turn depends on the length of the output of all the instructions that precede it, including the said jump. This creates a loop of dependencies and the assembler needs to find a solution that fulfills all the constraints created by the source text. This would not be possible if assembler was just an imperative interpreter. Its language is thus in some aspects declarative.

Finding a solution for such circular dependencies may resemble solving an equation, and it is even possible to construct an example where flat assembler g is indeed capable of solving one:

        x = (x-1)*(x+2)/2-2*(x+1)
        db x

The circular reference has been reduced here to a single definition that references itself to construct the value. The flat assembler g is able to find a solution in this case, though in many others it may fail. The method used by this assembler is to perform multiple passes over the source text and then try to predict all the values with the knowledge gathered this way. This approach is in most cases good enough for the assembly of machine codes, but rarely suffices to solve the complex equations and the above sample is one of the exceptions.

What are the means of parsing the arguments of an instruction?

Not all instructions have a simple syntax like then ones in the previous examples. To aid in the processing of arguments that may contain special constructions, flat assembler g provides a few capable tools, demonstrated below on the examples that implement selected few instructions of the Z80 processor. The rules governing the use of presented features are found in the manual.

When an instruction has a very small set of allowed arguments, each one of them can be treated separately with the "match" construction:

        macro EX? first,second
                match (=SP?), first
                        match =HL?, second
                                db 0E3h
                        else match =IX?, second
                                db 0DDh,0E3h
                        else match =IY?, second
                                db 0FDh,0E3h
                        else
                                err "incorrect second argument"
                        end match
                else match =AF?, first
                        match =AF'?, second
                                db 08h
                        else
                                err "incorrect second argument"
                        end match
                else match =DE?, first
                        match =HL?, second
                                db 0EBh
                        else
                                err "incorrect second argument"
                        end match
                else
                        err "incorrect first argument"
                end match
        end macro

        EX (SP),HL
        EX (SP),IX
        EX AF,AF'
        EX DE,HL

The "?" character appears in many places to mark the names as case-insensitive and all these occurrences could be removed to further simplify the example.

When the set of possible values of an argument is larger but has some regularities, the textual substitutions can be defined to replace some of the symbols with carefully chosen constructions that can then be recognized and parsed:

        A? equ [:111b:]
        B? equ [:000b:]
        C? equ [:001b:]
        D? equ [:010b:]
        E? equ [:011b:]
        H? equ [:100b:]
        L? equ [:101b:]

        macro INC? argument
                match [:r:], argument
                        db 100b + r shl 3
                else match (=HL?), argument
                        db 34h
                else match (=IX?+d), argument
                        db 0DDh,34h,d
                else match (=IY?+d), argument
                        db 0FDh,34h,d
                else
                        err "incorrect argument"
                end match
        end macro

        INC A
        INC B
        INC (HL)
        INC (IX+2)

This approach has a trait that may not always be desirable: it allows to use an expression like "[:0:]" directly in an argument. But it is possible to prevent exploiting the syntax in such way by using a prefix in the "match" construction:

        REG.A? equ [:111b:]
        REG.B? equ [:000b:]
        REG.C? equ [:001b:]
        REG.D? equ [:010b:]
        REG.E? equ [:011b:]
        REG.H? equ [:100b:]
        REG.L? equ [:101b:]

        macro INC? argument
                match [:r:], REG.argument
                        db 100b + r shl 3
                else match (=HL?), argument
                        db 34h
                else match (=IX?+d), argument
                        db 0DDh,34h,d
                else match (=IY?+d), argument
                        db 0FDh,34h,d
                else
                        err "incorrect argument"
                end match
        end macro

In case of an argument structured like "(IX+d)" it could sometimes be desired to allow other algebraically equivalent forms of the expression, like "(d+IX)" or "(c+IX+d)". Instead of parsing every possible variant individually, it is possible to let the assembler evaluate the expression while treating the selected symbol in a distinct way. When a symbol is declared as an "element", it has no value and when it is used in an expression, it is treated algebraically like a variable term in a polynomial.

        element HL?
        element IX? 
        element IY? 

        macro INC? argument
                match [:r:], argument
                        db 100b + r shl 3
                else match (a), argument
                        if a eq HL
                                db 34h
                        else if a relativeto IX
                                db 0DDh,34h,a-IX
                        else if a relativeto IY
                                db 0FDh,34h,a-IY
                        else
                                err "incorrect argument"
                        end if
                else
                        err "incorrect argument"
                end match
        end macro

        INC (3*8+IX+1)

        virtual at IX
                x db ?
                y db ?
        end virtual        

        INC (y)

There is a small problem with the above macroinstruction. A parameter may contain any text and when such value is placed into an expression, it may induce erratic behavior. For example if "INC (1|0)" was processed, it would turn the "a eq HL" expression into "1|0 eq HL" and this logical expression is correct and true even though the argument was malformed. Such unfortunate side-effect is a consequence of macroinstructions operating on a simple principle of text substitution (and the best way to avoid such problems is to use CALM instead). Here, to prevent it from happening, a local variable may be used as a proxy holding the value of an argument:

        macro INC? argument
                match [:r:], argument
                        db 100b + r shl 3
                else match (a), argument
                        local value
                        value = a
                        if value eq HL
                                db 34h
                        else if value relativeto IX
                                db 0DDh,34h,a-IX
                        else if value relativeto IY
                                db 0FDh,34h,a-IY
                        else
                                err "incorrect argument"
                        end if
                else
                        err "incorrect argument"
                end match
        end macro

There is an additional advantage of such proxy variable, thanks to the fact that its value is computed before the macroinstruction begins to generate any output. When an expression contains a symbol like "$", it may give different values depending where it is calculated and the use of proxy variable ensures that the value taken is the one obtained by evaluating the argument before generating the code of an instruction.

When the set of symbols allowed in expressions is larger, it is better to have a single construction to process an entire family of them. An "element" declaration may associate an additional value with a symbol and this information can then be retrieved with the "metadata" operator applied to a linear polynomial that contains given symbol as a variable. The following example is another variant of the previous macroinstruction that demonstrates the use of this feature:

        element register
        element A? : register + 111b
        element B? : register + 000b
        element C? : register + 001b
        element D? : register + 010b
        element E? : register + 011b
        element H? : register + 100b
        element L? : register + 101b

        element HL?
        element IX? 
        element IY? 

        macro INC? argument
                local value
                match (a), argument
                        value = a
                        if value eq HL
                                db 34h
                        else if value relativeto IX
                                db 0DDh,34h,a-IX
                        else if value relativeto IY
                                db 0FDh,34h,a-IY
                        else
                                err "incorrect argument"
                        end if
                else match any more, argument
                        err "incorrect argument"
                else
                        value = argument
                        if value eq value element 1 & value metadata 1 relativeto register
                                db 100b + (value metadata 1 - register) shl 3
                        else
                                err "incorrect argument"
                        end if
                end match
        end macro

The "any more" pattern is there to catch any argument that contains a complex expressions consisting of more than one token. This prevents the use of syntax like "INC A+0" or "INC A+B-A". But in case of some of the instructions sets, the inclusion of such constraint may depend on a personal preference.

The "value eq value element 1" condition ensures that the value does not contain any terms other than the name of a register. Even when an argument is forced to contain no more than a single token, it is still possible that is has a complex value, for instance if there were definitions like "X = A + B" or "Y = 2 * A". Both "INC X" and "INC Y" would then cause the operator "element 1" to return the value "A", which differs from the value checked in either case.

If an instruction takes a variable number of arguments, a simple way to recognize its various forms is to declare an argument with "&" modifier to pass the complete contents of the arguments to "match":

        element CC
        
        NZ? := CC + 000b
        Z?  := CC + 001b
        NC? := CC + 010b
        C?  := CC + 011b
        PO  := CC + 100b
        PE  := CC + 101b
        P   := CC + 110b
        M   := CC + 111b   

        macro CALL? arguments&
                local cc,nn
                match condition =, target, arguments
                        cc = condition - CC
                        nn = target
                        db 0C4h + cc shl 3
                else
                        nn = arguments
                        db 0CDh                     
                end match
                dw nn
        end macro

        CALL 0
        CALL NC,2135h

This approach also allows to handle other, more difficult cases, like when the arguments may contain commas or are delimited in different ways.

The CALM macros are capable of finer control over the process of parsing the arguments, and their variant of "match" has more options. Here is how the above macro could be rewritten:

        calminstruction CALL? target&
                local   condition
                match   condition:name =, target, target, :
                jno     unconditional
                check   defined condition & condition relativeto CC
                jno     error
                emit    1, 0C4h + (condition - CC) shl 3
                jump    address
            error:
                err     "unrecognized syntax"
            unconditional:
                emit    1, 0CDh
            address:
                emit    2, target
        end calminstruction

This time no proxies are necessary, because the text of the arguments is only evaluated when computing the expressions that contain them. Furthermore, "match" could be used to check for malformed expressions, just like it checks for a correct name of condition symbol in the example.

How are the labels processed?

A standard way of defining a label is by following its name with ":" (this also acts like a line break and any other command, including another label, may follow in the same line). Such label simply defines a symbol with the value equal to the current address, which initially is zero and increases when any bytes are added into the output.

In some variants of assembly language it may be desirable to allow label to precede an instruction without an additional ":" inbetween. It is then necessary to create a labeled macroinstruction that after defining a label passes processing to the original macroinstruction with the same name:

        struc INC? argument
                .:
                INC argument
        end struc

        start   INC A
                INC B

This has to be done for every instruction that needs to allow this kind of syntax. A simple loop like the following one would suffice:

        iterate instruction, EX,INC,CALL
                struc instruction? argument
                        .: instruction argument
                end struc
        end iterate

Every built-in instruction that defines data already has the labeled variant.

By defining a labeled instruction that has "?" in place of name it is possible to intercept every line that starts with an identifier that is not a known instruction and is therefore assumed to be a label. The following one would allow a label without ":" to begin any line in the source text (it also handles the special cases so that labels followed with ":" or with "=" and a value would still work):

        struc ? tail&
                match :, tail 
                        .: 
                else match : instruction, tail
                        .: instruction
                else match == value, tail
                        . = value
                else 
                        .: tail
                end match 
        end struc

Obviously, it is no longer needed to define any specific labeled macrointructions when a global effect of this kind is applied. A variant should be chosen depending on the type of syntax that needs to be allowed.

Intercepting even the labels defined with ":" may become useful when the value of current address requires some additional processing before being assigned to a label - for example when a processor uses addresses with a unit larger than a byte. The intercepting macroinstruction might then look like this:

        struc ? tail&
                match :, tail 
                        label . at $ shr 1
                else match : instruction, tail
                        label . at $ shr 1
                        instruction
                else
                        . tail
                end match
        end struc

The value of current address that is used to define labels may be altered with "org". If the labels need to be differentiated from absolute values, a symbol defined with "element" may be used to form an address:

        element CODEBASE
        org CODEBASE + 0

        macro CALL? argument
                local value
                value = argument
                if value relativeto CODEBASE
                        db 0CDh
                        dw value - CODEBASE
                else
                        err "incorrect argument"
                end if 
        end macro

To define labels in an address space that is not going to be reflected in the output, a "virtual" block should be declared. The following sample prepares macroinstructions "DATA" and "CODE" to switch between generating program instructions and data labels. Only the instruction codes would go to the output:

        element DATA
        DATA_OFFSET = 2000h
        element CODE
        CODE_OFFSET = 1000h

        macro DATA?
                _END
                virtual at DATA + DATA_OFFSET
        end macro

        macro CODE?
                _END
                org CODE + CODE_OFFSET
        end macro

        macro _END?
                if $ relativeto DATA
                        DATA_OFFSET = $ - DATA
                        end virtual
                else if $ relativeto CODE
                        CODE_OFFSET = $ - CODE
                end if
        end macro

        postpone
                _END
        end postpone

        CODE

The "postpone" block is used here to ensure that the "virtual" block always gets closed correctly, even if source text ends with data definitions.

Within the environment prepared by the above sample any instruction would be able to distinguish data labels from the ones defined within program. For example a branching instruction could be made to accept an argument being either a label within a program or an absolute value, but to disallow any label of data:

        macro CALL? argument
                local value
                value = argument
                if value relativeto CODE
                        db 0CDh
                        dw value - CODE
                else if value relativeto 0
                        db 0CDh
                        dw value
                else
                        err "incorrect argument"
                end if 
        end macro

        DATA

        variable db ?

        CODE

        routine:

In this context either "CALL routine" or "CALL 1000h" would be allowed, while "CALL variable" would not be.

When the labels have values that are not absolute numbers, it is possible to generate relocations for instructions that use them. A special "virtual" block may be used to store the offsets of values inside the program that need to be relocated when its base changes:

        virtual at 0
                Relocations::
                rw RELOCATION_COUNT
        end virtual

        RELOCATION_INDEX = 0

        postpone
                RELOCATION_COUNT := RELOCATION_INDEX                
        end postpone

        macro WORD? value
                if value relativeto CODE
                        store $ - CODE : 2 at Relocations : RELOCATION_INDEX shl 1
                        RELOCATION_INDEX = RELOCATION_INDEX + 1
                        dw value - CODE
                else
                        dw value
                end if
        end macro 

        macro CALL? argument
                local value
                value = argument
                if value relativeto CODE | value relativeto 0
                        db 0CDh
                        word value
                else
                        err "incorrect argument"
                end if 
        end macro 

The table of relocations that is created this way can then be accessed with "load". The following two lines could be used to put the table in its entirety somewhere in the output:

        load RELOCATIONS : RELOCATION_COUNT shl 1 from Relocations : 0
        dw RELOCATIONS

The "load" reads the whole table into a single string, then "dw" writes it into output (padded to multiple of a word, but in this case the string never requires such padding).

For more complex types of relocations additional modifier may need to be employed. For example, if upper and lower portions of an address needed to be stored in separate places (likely across two instructions) and relocated separately, necessary modifiers could be implemented as follows:

        element MOD.HIGH
        element MOD.LOW

        HIGH? equ MOD.HIGH +
        LOW? equ MOD.LOW +

        macro BYTE? value
                if value relativeto MOD.HIGH + CODE
                        ; register HIGH relocation
                        db (value - MOD.HIGH - CODE) shr 8
                else if value relativeto MOD.LOW + CODE
                        ; register LOW relocation
                        db (value - MOD.LOW - CODE) and 0FFh
                else if value relativeto MOD.HIGH
                        db (value - MOD.HIGH) shr 8
                else if value relativeto MOD.LOW
                        db (value - MOD.LOW) and 0FFh
                else
                        db value
                end if
        end macro 

The commands that would register relocation have been omitted for clarity, in this case not only offset within code but some additional information would need to registered in appropriate structures. With such preparation, relocatable units in code might be generated like:

        BYTE HIGH address
        BYTE LOW address

Such approach allows to easily enable syntax with modifiers in any instruction that internally uses "byte" macroinstruction when generating code.

How can multiple sections of file be generated in parallel?

This assembly engine has a single main output that has to be generated sequentially. This may seem problematic when the file needs to contain distinct sections for code and data, collected from interleaved pieces that may be spread across multiple source files. There are, however, a couple of methods to handle it, all based in one way or another on forward-referencing capabilities of the assembler.

A natural approach is to define contents of auxiliary section in "virtual" block and copy it to appropriate position in the output with a single operation. When a "virtual" block is labeled, it can be re-opened multiple times to append more data to it.

                include '8086.inc'
                org     100h
                jmp     CodeSection

        DataSection:

                virtual
                        Data::
                end virtual

                postpone
                        virtual Data
                                load Data.OctetString : $ - $$ from $$
                        end virtual
                end postpone

                db Data.OctetString

        CodeSection:

                virtual Data
                        Hello db "Hello!",24h
                end virtual

                mov     ah,9
                mov     dx,Hello
                int     21h

                virtual Data
                        ExitCode db 37h
                end virtual

                mov     ah,4Ch
                mov     al,[ExitCode]
                int     21h

This leads to a relatively simple syntax even without help of any additional macros.

Another method could be to put the pieces of the section into macros and execute them all at the required position in source. A disadvantage of such approach is that tracing errors in definitions might become a bit cumbersome.

The techniques that allow to easily append to a section generated in parallel can also be very useful to generate data structures like relocation tables. Instead of "store" commands used earlier when demonstrating the concept, regular data directives could be used inside a re-opened "virtual" block to create relocation records.

What options are there to parse other kinds of syntax?

In some cases a command that assembler needs to parse may begin with something different than a name of instruction or a label. It may be that a name is preceded by a special character, like "." or "!", or that it is an entirely different kind of construction. It is then necessary to use "macro ?" to intercept whole lines of source text and process any special syntax of such kind.

For example, if it was required to allow a command written as ".CODE", it would not be possible to implement it directly as a macroinstruction, because initial dot causes the symbol to be interpreted as a local one and globally defined instruction could never be executed this way. The intercepting macroinstruction provides a solution:

        macro ? line&
                match .=CODE?, line
                        CODE
                else match .=DATA?, line
                        DATA
                else
                        line
                end match
        end macro  

The lines that contain either ".CODE" or ".DATA" text are processed here in such a way, that they invoke the global macroinstruction with corresponding name, while all other intercepted lines are executed without changes. This method allows to filter out any special syntax and let the assembler process the regular instructions as usual.

Sometimes unconventional syntax is expected only in a specific area of source text, like inside a block with defined boundaries. The parsing macroinstruction should then be applied only in this place, and removed with "purge" when the block ends:

        macro concise
                macro ? line&
                        match =end =concise, line
                                purge ?
                        else match dest+==src, line
                                ADD dest,src
                        else match dest-==src, line
                                SUB dest,src
                        else match dest==src, line
                                LD dest,src
                        else match dest++, line
                                INC dest
                        else match dest--, line
                                DEC dest
                        else match any, line
                                err "syntax error"
                        end match
                end macro
        end macro

        concise
                C=0
                B++
                A+=2
        end concise

A macroinstruction defined this way does not intercept lines that contain directives controlling the flow of the assembly, like "if" or "repeat", and they can still be used freely inside such a block. This would change if the declaration was in the form "macro ?! line&". Such a variant would intercept every line with no exception.

Another option to catch special commands might be to use "struc ?" to intercept only lines that do not start with a known instruction (the initial symbol is then treated as label). Since this one only tests unknown commands, it should cause less overhead on the assembly:

        struc (head) ? tail&
                match .=CODE?, head
                        CODE tail
                else
                        head tail
                end match
        end struc

All these approaches hide a subtle trap. A label defined with ":" may be followed by another instruction in the same line. If that next instruction (which here becomes hidden in the "tail" parameter) is a control directive like "if", putting it inside the "else" clause is going to cause broken nesting of control blocks. A possible solution is to somehow invoke "tail" contents outside of "match" block. One way could be to call a special macro:

        struc (head) ? tail&
                local invoker
                match .=CODE?, head
                        macro invoker
                                CODE tail
                        end macro
                else
                        macro invoker
                                head tail
                        end macro
                end match
                invoker
        end struc

A simpler option is to call the original line directly and when override is needed, cause it to be ignored with help of another line interceptor (disposing of itself immediately after):

        struc (head) ? tail&
                match .=CODE?, head
                        CODE tail
                        macro ? line&
                                purge ?
                        end macro
                end match
                head tail
        end struc

However, a much better way of avoiding this kinds of pitfalls is to use CALM instructions instead of standard macros. There it is possible to process arguments and assemble the original or modified line without use of any control directives. CALM instructions also offer a much better performance, which might be especially important in case of interceptors that get called for nearly every line in source text.

How to control recognition context for symbol identifiers?

When designing a macro for a general use, it is important to ensure that it works correctly wherever it is called. While the "local" directive allows to create private symbols for every instance of macro without interfering with an unknown environment, sometimes a variable may need to be shared across the multiple macro calls. A globally defined symbol usually works well enough for this purpose, but it is something that could be interfered with. Consider the following example:

        GLOBAL_STATE = 0

        macro state 
                db GLOBAL_STATE
        end macro

        macro switch value
                GLOBAL_STATE = GLOBAL_STATE xor (value)
        end macro

This solution has a few issues. First, if symbol called "GLOBAL_STATE" was also used anywhere for different purposes, it would interfere with these macros. A choice of rare and specific name for a global may help a bit (and if there is a need for multiple global symbols, it is better to create a dedicated namespace for them, preferably with an unusual name), but it would be better to separate these symbols somehow.

Another problem would manifest if the above macros were used inside any "namespace" block, because then a definition of "GLOBAL_STATE" would instead create a symbol in the other namespace. There is one basic solution to the second problem: it is enough to add the dot at the end of the identifier to ensure that the defined symbol is the same whose value is accessed on the right-hand side:

        GLOBAL_STATE. = GLOBAL_STATE xor (value)

There is, however, another option. The arguments given to macro preserve the context for any identifiers in their text (unless specifically told not to do so, when the name of an argument is preceded by "&"). Following example demonstrates the effect:

        macro tester arg
                namespace X
                        a = 0
                        db a
                        db arg
                end namespace
        end macro

        a = 3
        tester a

Both data definitions that get assembled have he same text: "db a", but the second one interprets "a" in the context in which the macro was called and generates byte 3 instead of 0. This is achieved though a mechanism that could be interpreted as a kind of text coloring. The value of parameter "arg" has an additional property, like a color of text, that makes the first "db a" differ from the second one. Now, if we define a macro inside another macro, the text of that inner macro is going to preserve the coloring. The following example uses this mechanism to improve the definition of "state"/"switch" macros from earlier sample:

        macro setup variable

                variable = 0

                macro state
                        db variable
                end macro

                macro switch value
                        variable = variable xor (value)
                end macro

        end macro

        setup GLOBAL_STATE

        namespace Program

                switch 8

                GLOBAL_STATE = 0

                state

        end namespace

When the "setup GLOBAL_STATE" gets assembled, the "switch" macro is defined with a body containing the text:

        GLOBAL_STATE = GLOBAL_STATE xor (value)

where both instances of "GLOBAL_STATE" have the global context attached to them. Other parts of text have no such additional information associated. However, when "switch 8" is assembled, the line generated by macro is:

        GLOBAL_STATE = GLOBAL_STATE xor (8)

and this time there is another colored piece of text, the value of "8" carries the context in which the "switch" macro was called.

The above variant of "switch" and "state" always uses the global variable, no matter where they are called. This bears a similarity to the concept of closure present in many programming languages.

The variable is still global, though, and other code could still interfere with it. To make the variable completely out of reach of others, we can use the "local" directive, which creates a special parameter that replaces its name with the same text, but colored specifically to carry the context unique to the instance of a macro:

        macro setup

                local variable

                variable = 0

                macro state
                        db variable
                end macro

                macro switch value
                        variable = variable xor (value)
                end macro

        end macro

        setup

Another place where this kind of context transfer occurs is any value of a symbolic variable. When a symbol is defined with "equ" or "define", its entire value becomes marked with the context that was present at the time of definition. When the text of such value is assigned to a parameter with "match" or "irpv", the context information carried by the snippets of text is preserved:

        First:
                .x = 1

        define LIST .x

        Second:
                .x = 2

        LIST equ LIST, .x

        match values, LIST
                display `values
                db values
        end match

The "display" shows that the text extracted from "LIST" variable with "match" is ".x, .x". But the texts of these two identifiers have different contexts, and it turns out the list contains two different values.

When a "match" cuts the text of an identifier into many parts, all parts preserve the context information of the original text. And if an identifier becomes patched up from the parts of text from different sources, only the context associated with the initial part has any effect on the recognition of the symbol. This allows to force any symbol to be recognized in current context by prepending a "#" to it:

        macro tester name
                namespace my
                        name db ?       ; symbol defined in its original namespace
                        #name db ?      ; symbol defined in "my" namespace
                end namespace
        end macro 

If such trick is not enough, and a text is needed to be stripped of any context information it might have, the "rawmatch" directive does exactly that. A similar effect could be achieved by converting the value of a parameter into a string with the ` operator and then interpreting such pure text with "eval".

An argument to macro that is prefixed with "&" does not any context information to its value, but it also does not remove any such information if it is already carried by given text or some parts of it. This allows to pass text safely without losing any information, but also without contaminating it with unwanted context, as demonstrated by this multi-stage example:

        calminstruction (var) transparent_equ &val&
                publish :var, val
        end calminstruction

        namespace Windows
                EOL := string 0x0A0D
                link equ EOL
        end namespace

        EOL := 0x0A

        match WinEOL, Windows.link
                list    equ             WinEOL, EOL	; added context
                list    transparent_equ WinEOL, EOL     ; no context added
        end match

        namespace C64
                EOL := 0x0D
                irpv items, list
                        db items
                end irpv
        end namespace

All the items passed to "db" contain the text "EOL", but interpreted differently: within context of namespace "Windows", of a global namespace, and finally as a raw text without context (in which case it ends up pointing to the nearest definition, the one in the namespace of "C64").

How to define an instruction sharing a name with one of the core directives?

It may happen that a language can be in general easily implemented with macros, but it needs to include a command with the same name as one of the directives of assembler. While it is possible to override any instruction with a macro, macros themself may require an access to the original directive. To allow the same name call a different instruction depending on the context, the implemented language may be interpreted within a namespace that contains overriding macro, while all the macros requiring access to original directive would have to temporarily switch to another namespace where it has not have been overridden. This would require every such macro to pack its contents in a "namespace" block.

But there is another trick, related to how texts of macro parameters or symbolic variables preserve the context under which the symbols within them should be interpreted (this includes the base namespace and the parent label for symbols starting with dot).

Unlike the two mentioned occurences, the text of a macro normally does not carry such extra information, but if a macro is constructed in such way that it contains text that was once carried within a parameter to another macro or within a symbolic variable, then this text retains the information about context even when it becomes a part of a newly defined macro. For example:

        macro definitions end?
                namespace embedded
                struc LABEL? size
                        match , size
                                .:
                        else
                                label . : size
                        end match
                end struc
                macro E#ND? name
                        end namespace
                        match any, name
                                ENTRYPOINT := name
                        end match
                        macro ?! line&
                        end macro
                end macro
        end macro

        definitions end

        start LABEL
        END start

The parameter given to "definitions" macro may appear to do nothing, as it replaces every instance of "end" with exactly the same word - but the text that comes from the parameter is equipped with additional information about context, and this attribute is then preserved when the text becomes a part of a new macro. Thanks to that, macro "LABEL" can be used in a namespace where "end" instruction has taken a different meaning, but the instances of "end" within its body still refer to the symbol in the outer namespace.

In this example the parameter has been made case-insensitive, and thus it would replace even the "END" in "macro" statement that is supposed to define a symbol in "embedded" namespace. For this reason the identifier has been split with a concatenation operator to prevent it from being recognized as parameter. This would not be necessary if the parameter was case-sensitive (as more usual).

The same effect can be achieved through use of symbolic variables instead of macro parameters, with help of "match" to extract the text of a symbolic variable:

        define link end
        match end, link
                namespace embedded
                struc LABEL? size
                        match , size
                                .:
                        else
                                label . : size
                        end match
                end struc
                macro END? name
                        end namespace
                        match any, name
                                ENTRYPOINT := name
                        end match
                        macro ?! line&
                        end macro
                end macro
        end match

        start LABEL
        END start

This would not work without passing the text through symbolic variable, because parameters defined by control directives like "match" do not add context information to the text unless it was already there.

CALM instructions allow for another approach to this kind of problems. If a customized instruction set is defined entirely in form of CALM, they may not even need an access to original control directives. However, if CALM instruction needs to assemble a directive that might not be accessible, the symbolic variable passed to "assemble" should be defined with appropriate context for the instruction symbol.

How to convert a macroinstruction to CALM?

A classic macroinstruction consists of lines of text that are preprocessed (by replacing names of parameters with their corresponding values) every time the instruction is called and these preprocessed lines are passed to assembly. For example this macroinstruction generates just a single line to be assembled, and it does it by replacing "number" with the text given by the only argument to the instruction:

        macro octet value*
                db value
        end macro

A CALM instruction can be viewed as customized preprocessor, which needs to be written in a special language. It is able to use various commands to process the arguments and generate lines to be assembled. On the basic level, it is also able to simulate what standard preprocessor does - with help of "arrange" command. After preprocessing the line, it also needs to explicitly pass it to the assembly with an "assemble" command:

        calminstruction octet value*
                arrange value, =db value
                assemble value
        end calminstruction

This gives the same result as the original macroinstruction, as it performs the same kind of preprocessing. However, unlike the text of macroinstruction a pattern given to "arrange" needs to explicitly state which name tokens are to be replaced with their values and which ones (prepended with "=") should be left untouched. The tokens that are copied from the pattern are stripped of any context information, just like the text of macroinstruction is normally not carrying any (while the values that came from arguments retain the recognition context in which the instruction was started).

This is the most straightforward method of conversion and a simple sequence of "arrange" and "assemble" commands could be made to generate the same lines as by the original macroinstruction. But there is one exception - when a "local" command is executed by macroinstruction, it creates a preprocessed parameter with a special value that points to a symbol in the namespace unique to given instance of the instruction.

        macro pointer
                local next
                dd next
            next:
        end macro

In case of CALM there is no such namespace available, the local namespace of a CALM instruction is shared among all its instances. Therefore, if a new unique symbol is needed every time the instruction is called, it has to be constructed manually. An obvious method might be to append a unique number to the name:

        global_uid = 0

        calminstruction pointer
                compute global_uid, global_uid + 1
                local command
                arrange command, =dd =next#global_uid
                assemble command
                arrange command, =next#global_uid:
                assemble command
        end calminstruction

Here "arrange" is given a variable that has a numeric value and it has to replace it with a text. This works only when the value is a plan non-negative number, in such case "arrange" converts it to a text token that contains decimal representation of that number. The lines passed to assembly are therefore going to contains identifiers like "next#1".

While incrementation of the global counter could be done by preparing a standard assembly command like "global_uid = global_uid + 1" with "arrange" and passing it to assembly, "compute" command allows to do it directly in the CALM processor. Moreover, it is then not affected by anything that alters the context of assembly. If the instruction was defined as unconditional and used inside a skipped IF block, the "compute" would still perform its task, because execution of CALM commands is - just like standard preprocessing - done independently from the main flow of the assembly. Also, references to the "global_uid" always point to the same symbol - the one that was in scope when the CALM instruction was defined and compiled. Therefore incrementing the value with "compute" is more reliable and predictable.

In a similar manner, the assembly of line defining the label can be replaced with a "publish" command. Here the value of the label (which should be equal to the address after the line containing "dd" is assembled) needs to be computed first, because "publish" only performs the assignment of a value to the symbol:

        global_uid = 0

        calminstruction pointer
                compute global_uid, global_uid + 1
                local symbol, command
                arrange symbol, =next#global_uid
                arrange command, =dd symbol
                assemble command
                local address
                compute address, $
                publish symbol:, address
        end calminstruction 

Because the CALM instruction itself is conditional, the "publish" inside is effectively conditional, too. Therefore it works correctly as a replacement for the assembly of line with a label.

While a global counter has several advantages, it can be interfered with, so sometimes use of a local counter might be preferable. However, the local namespace of CALM instruction is not normally not accessible from outside, so it is a bit harder to give an initial value to such counter. One way could be to check whether the counter has already been initialized with some value using "take" command:

        calminstruction pointer
                local id
                take id, id
                jyes increment
                compute id, 0
            increment:
                compute id, id + 1
                local symbol, command
                arrange symbol, =next#id
                arrange command, =dd symbol
                assemble command
                local address
                compute address, $
                publish symbol:, address
        end calminstruction 

But this adds commands that are executed every time the instruction is called. A better solution makes use of the ability to define custom instructions processed during the definition of CALM instruction:

        calminstruction calminstruction?.init? var*, val:0
                compute val, val
                publish var, val
        end calminstruction

        calminstruction pointer
                local id
                init id, 0
                compute id, id + 1
                local symbol, command
                arrange symbol, =next#id
                arrange command, =dd symbol
                assemble command
                local address
                compute address, $
                publish symbol:, address
        end calminstruction 

The custom statement "init" is called at the time when the CALM instruction is defined (it does not generate any commands to be executed by the defined instruction - it would itself have to use "assemble" commands to generate statements to be compiled). It is given the name of variable from the local scope of the CALM instruction, and it uses "publish" to assign an initial numeric value to that variable.

To initialize local variable with a symbolic value, even simpler custom instruction would suffice:

        calminstruction calminstruction?.initsym? var*, val&
                publish var, val
        end calminstruction

The text of "val" argument carries the recognition context of the definition of CALM instruction that contains the "initsym" statement, therefore it allows to prepare a text for "assemble" containing references to local symbols:

        calminstruction bigendian32? value
                local command
                initsym command, dd value
                compute value, value bswap 4
                assemble command
        end calminstruction 

Again, after this instruction is compiled, it contains just two actual commands, "compute" and "assemble", and the value of local symbol "command" is a text that is interpreted in the same local context and refers to the same symbol "value" as the "compute" does.

This example also demonstrates another advantage of CALM over standard macroinstructions: its strict semantics prevent various kinds of unwanted behavior that is allowed by a simple substitution of text. The text of "value" is going to be evaluated by "compute" as a numeric sub-expression, signalling an error on any unexpected syntax. Therefore it should be favorable to process arguments entirely through CALM commands and only use "assemble" for final simple statements. And even these could be eliminated in cases where a CALM command exists to perform the same task:

        calminstruction bigendian32? value
                emit 4, value bswap 4
        end calminstruction 

This performs operations completely independently from the standard assembly process, so if the containing instruction was defined as unconditional, the output would be generated even if it was invoked inside a skipped "if" block or while defining a macroinstruction (and it would happen without adding any line to the definition).

Copyright © 1999-2025, Tomasz Grysztar. Also on GitHub, YouTube.

Website powered by rwasa.