Assembers and Assembly Language

In the last section we looked at a simulator program for the Mythical Machine. In this section and the next we'll see how software evolved to make programming computers much easier.

Click here for the Python code for the MM Assembler

Problems with Machine Language

Programming in machine language is tedious for some obvious reasons. One is that you have to keep track of the numerical operation codes and not choose the wrong one. In coding the little programs of the last section we cheated a bit in that our format allowed us to put a comment after the address and assembled instruction which the simulator simply ignored. Without those comments the code would be extruciating to follow.

Another problem occurs if we are assembling a jump instruction. We only know the target address if the jump is back to a previous instruction. If the jump is forward to an instruction yet to come then we probably don't know what the address will be. So we have to come back later to complete the jump instruction.

But here is the worst problem. Suppose our initial code contains errors that require more (or fewer) instructions to fix. Then some jump instructions may have the wrong target addresses when the code is fixed and other instructions get shifted in memory. A major headache.

One quick and dirty approach to this problem was to apply what was called a binary patch. Where the code change occurred, we replaced whatever instruction was presently there with a jump to some available memory. At that address we put the replaced instruction, any new instructions and finally a jump back to the original spot. That saved other jumps around the patch from needing modification but code like this very quickly becomes messy and unmanagable.

Assembler Programs

The answer to these problems was to create "assembler" programs that let us represent the operation codes, registers, and addresses symbolically and let this program assemble the numeric instructions for us. When changes are made in the program, the assembler code is modified and the entire program re-assembled to machine code.

In assembler pieces of each instruction are represented in a way that is much more readable to humans. For example "add r1,r2" could mean add register 2 to register 1. The assembler would assemble the opcode (05) with the 2 register arguments to create the instruction 051002. Jump destinations and data addresses are determined by applying a label to an instruction or data point. This will be clearer with an example. Here is the assembly language version of our previous program to add a list of numbers together.

go    ld#  r0,0      register 0 will hold the sum, init it
      ld#  r1,nums   register 1 points to the next number to add
      ld#  r2,1      register 2 holds the constant one.
loop  ldi  r3,r1     get next number into register 3
      jz   r3,done   if its zero we're finished
      add  r0,r3     otherwise add it to the sum
      add  r1,r2     add one to register one (next number to load)
      jmp  loop      go for the next one
done  hlt  00        all done. sum is in register 0
nums  123            the numbers to add
      234
      345
        0            end of the list

You probably already see what is going on here. The symbols r1,r2 represent the general registers. The symbols "ld#", "ldi", "ldr", "jz", "jmp" and "hlt" are operation codes we are using. Finally the symbols "go", "loop", "done" and "nums" are labels arbitrarily chosen to represent memory addresses. We don't know what those memory addresses will be and we don't really care. The assembler program will figure that out for us.

Each line has the information for the assembler to build a single machine instruction. One, two, or three fields may be followed by an optional comment.

In the first line we have the label "go" in the first field. This will create a symbol "go" that will contain the address of this instruction.

The second field is the operation code "ld#" (load number) and the third field "r0,0" provides the information to complete the instruction. Some instructions require both a register and address argument, "jmp" requires only an address, and "hlt" needs no argument at all. But the second field may contain a simple number which lets us put data into the program.

An Assembler for MM in Python

To follow along, get the program code for the MM Assembler and print it or put it into another window.

The assembler works in two passes. The first pass (function pass1) runs the pReg through the program to determine what address values need to be assigned for each of our labels. These are stored in the dictionary "lookup" along with the register definitions.

The second pass uses the opcode field and argument field to build instructions, substituting labels in the argument field with their numeric addresses.

Function main reads the entire program from standard input to a list of lines and then passes that list to functions pass1 and pass2.

Assemblers usually take an input file with contents like the above and produce two output files. One is the machine language in a form that the computer can load for execution. Generally it is combined with others that are "linked" to form a complete executable program. The other output is a listing file that would look like the following.

100 030000   go    ld#  r0,0      reg 0 will hold the sum
101 031107         ld#  r1,nums   reg 1 points to next num
102 042001   loop  ldi  r2,r1     get next number into reg 2
103 112106         jz   r2,done   if its zero we're done
104 050002         add  r0,r2     else add it to the sum
105 100102         jmp  loop      go for the next one
106 000000   done  hlt            all done. sum is in reg 0
107 000123   nums  123            the numbers to add
108 000234         234
109 000345         345
110 000000           0            zero marks the end

We shortcircuit the production of a machine language file by letting the simulator in the last section be able to simply read the listing file. The assembler language is simply comments for the simulator.

It is ok to have a label alone on a line. This makes it possible to have several symbols equated to the same address. We will use this feature in the next section when we develop a tiny compiler for MM.

Other Considerations

There are a couple of things that should be pointed out. Our assembler creates code that is then loaded directly into our computer (or simulator) in an address space determined by the assembler. In real life final addressing is actually determined later by another program called the "linker". The machine code is also designed to facilitate this. Instead of actual addresses in the instructions, it's more likely that an offset from the current instruction would be used. So that "jmp loop" would not use the address 102, but rather -3. This complicates other things because now we would need two different instructions for "ld# r0,0" and "ld# r0,nums" since in the first instance we really want the number zero but in the second case we want whatever the final address of "nums" will be.

However there was a tremendous advantage to code like this. It may be loaded anywhere in memory, and the linker program may link together many separate modules to create a single executable. On modern systems it is even possible to load modules at runtime and link them together. In fact, this is exactly what you do whenver you import a module into your Python program that was written in the C language and compiled to machine code.

You can download the zip file for this project here.

If you have comments or suggestions You can email me at mail me

* * *