Building Compilers and Interpreters

Introduction

In the early days of programming, people would write programs in pseudo code which would then be translated to the computers numeric machine code. Perhaps an assembler program which could assemble machine instructions from pneumonics (like ADD for add, or MUL for multiply) along with names that would be assigned to memory locations.
that the assembler would assign numbers to. For example if you (the programmer) wanted to compute a formula like
2 + 4 * 6

then a coder (perhaps someone else) might write a snippet like the following.

LDA  #4     ; load accumulator with number 4
MUL  #6     ; multiply by 6 leaving result in the accumulator
ADD  #2     ; add two to the product - result in accumulator

and the assembler program would translate this into machine code.

Of course the comments starting with the ";" would be more sophisticated since anyone would actually be able to read this code would know fully well what LDA, MUL and ADD mean.

But if you wanted instead

(2 + 4) * 6

then the snippet would look like

LDA  #4     ; load accumulator with number 4
ADD  #2     ; add two
MUL  #6     ; multiply by 6 leaving result in the accumlator

to force the addition to take place first.

The early language Fortran (for Formula-Translation) seemed a little magical for the standards of the time since it could do this (and more) automatically. Formulas (also called Expressions) could be nested with parantheses to any depth. Automatically dealing with such expressions is most easily done with a language that supports recursive functions. Early versions of Fortran did not.

Pascal came along a few years later which did support recursion, although a bit awkwardly. So, suddenly it became fairly easy to write programs to handle recursive data structures.

Pascal had something else going for it. An interesting technique was used to formally describe the Pascal syntax. Boxes connected with labeled arrows represented states and by looking at the next item in the input stream could choose an arrow leading to a new box if the arrows label contained the next item. The boxes were formed into groups that satisfied a syntactical element. An unlabled arrow could leave the group successfully. Entry from the left, Exit from the right. Here is a tiny example for the group unsigned integer which uses another group digit.

unsigned integer -----> digit -------->
                 ^         |
                 |         | (0-9)
                 |_________|

which means any consecutive string of digits (0-9) with at least one digit satifies the finding of a unsigned integer group. The exit arrow to the right, indicating success, is only taken when no other arrow can be used. Being unable to follow any arrow signifies a syntax error in the input.

Ten pages of such "railway boxes" suffice to describe the syntax of the complete pascal language. Here is an example of the Pascal railway .

It's not hard to translate such boxes directly into a Python program by simply converting each group into a function. Python, like all modern computer languages, supports recursion fine so these boxes may use each other recursively as well.

Our aim in this project is quite modest. We'll write functions for agents that compute numeric formulas with add, subtract, multiply and divide operations. We'll write a program that emulates "railroad boxes" and comes in 3 slightly different versions.

The first will check if a formula is syntatically valid, the second will act as an interpretor to evaluate the formula and the third will be a compiler that will translate the formula into something simpler with the same meaning (sematics).

The input formula is a character stream that will first be transformed into a token stream. Tokens are numbers, words and operators.

The token stream is parsed by our program from left to right looking ahead at just the next token and deciding what to do based on where it is and what the token is.

As blocks are completed, there is a return to the calling block (reduction) which may produce interpretation or translation.

Lexical Analysis

Our first task is to convert an input formula from a character stream into a token stream that is more compact and easier to deal with. This is best shown with an example.

"541 + (37-2)"                     # character stream (input program)

541 + ( 37 - 2 )                   # tokens seperated by spaces

[541, '+', '(' ,37, '-', 2, ')']   # tokens as a Python list

When reduced to a Python list integers are evaluated and become a single token. Operators, keywords and variable names are also reduced to individual strings. Although not part of our lexer, extra whitespace and comments would be discarded.

Here is most of the code from lex.py

01 # lex.py
02
03 from string import ascii_letters, digits, whitespace
04
05 wordStart = ascii_letters + "_"
06 wordInner = wordStart + digits
07 dualOps   = ('**','<=','==','&&','||')
08
09
10 def getDigit(p,prog) :
11     if prog[p] in digits : return (p+1,prog[p])
12     else                 : return (p,None)
13
14 def getNumber(p,prog) :
15     p,dig = getDigit(p,prog)
16     if dig == None : return (p,None)  # no leading digit
17     number = dig
18     while True :
19         p,dig = getDigit(p,prog)
20         if dig == None : return (p,int(number))
21         number += dig         # gather ascii digits
22
23 def getWord(p,prog) :
24     if prog[p] in wordStart :  # Python: alphabetic or "_"
25         word = prog[p]
26         p += 1
27         while prog[p] in wordInner :  # alphanumeric
28             word += prog[p]
29             p += 1
30         return (p,word)
31     return (p,None)
32
33 def lexScan(prog) :
34     prog += '$'     # eof marker
35     tokens = []
36     p = 0
37     while True :
38         while prog[p] in whitespace : p += 1
39         if (prog[p] == '$') : return tokens
40         p,word = getWord(p,prog)
41         if word : tokens.append(word)
42         else :
43             p,number = getNumber(p,prog)
44             if number != None : tokens.append(number)
45             elif prog[p:p+2] in dualOps :
46                 tokens.append(prog[p:p+2])  # "**" "<=" etc
47                 p = p+2
48             else :
49                 tokens.append(prog[p])
50                 p = p+1
51     return tokens

Functions getDigit, getNumber and getWord are passed the program, a character string and a pointer p showing where to start. If a function is successful, it returns a value and p updated to the next point in the program. If not successful, a function returns None and the pointer unchanged. See lines 16 and 20.

Words can be language keywords, names of variables, constants or functions. These strings may contain letters, digits or the underscore characture, but may not begin with a digit.

At line 43, if neither a word or number is found, the lexical scanner will assume the character at the pointer and perhaps even the following character build an operator. The dual character operators are defined in a lookup at line 7.

There is no code here for stripping comments or taking any values other than positive integers. Here is an example run.

$ python lex.py "2 * pi*radius**2"
2 * pi*radius**2
[2, '*', 'pi', '*', 'radius', '**', 2]

Syntax Checking a Token Stream

Following the model of the Pascal syntax diagrams, we convert diagrams into functions that gather sub-diagrams as needed. If a syntax error occurs our interpreter or compiler will simply stop and point to the current location of the error, much like Python does.

Usually, functions would be processed recursively but I want to take another approach. A class Calc contains a method for each diagram and an instance of Calc will process tokens to traverse the diagram to its exit. A given method can spawn more Calc instances as helpers.

During the computation there is a chain of Calc instances that grows and shrinks as the interpretation or compilation of the input tokens proceeds. This is equivalent to recursive functions but perhaps easier to visualize.

The Calc class

The Calc class has methods for the following

getExpr - calc value of sum of terms.

getTerm - calc value of product of units

getUnit - return value of integer, expression in (), or negated Unit

An expression consists of terms seperated by "+" or "-" operators indicating addition or subtraction. A term consists of units seperated by the operators "*" or "/" for multiplication or division. A unit, to begin with, is simply an unsigned integer.

So in the expression "45+14*2" the units are "45", "14" and "2". The terms are "45" and "14*2".

01 # syntax.py
02
03 import sys, main
04
05 typeInt = type(5)
06
07 class Calc :
08     def __init__(self, pc, tokens) :
09         self.pc = pc
10         self.tokens = tokens
11         self.value = None
12         self.code  = []
13

Attributes for a Calc instance are the following. tokens is the output from lexical scanner, basically the tokenized input program. pc is the program counter and points to the next token for this instance to process. pc is updated as it consumes tokens. The attributes value and code will be discussed later.

14     def getExpr(self) :
15         calc = Calc(self.pc, self.tokens).getTerm()
16         if calc == None : return self.fatal("getExpr","no first term")
17         self.pc   = calc.pc
18         while True :
19             oper = self.getSpec(("+","-"))
20             if not oper : return self
21             calc = Calc(self.pc, self.tokens).getTerm()
22             if calc == None : return self.fatal("getExpr","no expected term")
23             self.pc = calc.pc
24

The method getExpr will gather the values of terms seperated by + and -. To succeed there must be at least one term (checked at line 17). More terms found in the while loop will update self.value. Notice how terms are handled by sub-instances of Calc in lines 15 and 22 and the operator applied. A new Calc instance is spawned and immediately invoked to find a term in the input. If it's successful a pointer to the new instance is returned, otherwise None is returned. If successful then the old value and pc in the caller are updated. The new instance is discarded when getExpr returns.

25     def getTerm(self) :
26         calc = Calc(self.pc, self.tokens).getUnit()
27         if calc == None : return self.fatal("getTerm","no first unit")
28         self.pc   = calc.pc
29         while True :
30             oper = self.getSpec(("*","/"))
31             if not oper : return self
32             calc = Calc(self.pc, self.tokens).getUnit()
33             if calc == None : return self.fatal("getTerm","no expected unit")
34             self.pc = calc.pc
35

The method getTerm is almost a copy of getExpr except that the operators are now multiply and divide and getTerm looks for units instead of terms. With the code structured like this the multiplys and divides are done before the adds and subtracts honoring the expected precedence.

We're going to look at getUnit in chunks. It's the most interesting method.

36     def getUnit(self) :
37         tok = self.tokens[self.pc]
38         if type(tok) == typeInt :
39             self.pc += 1
40             return self

Basic units are numbers, constants or variables. In today's code only unsigned numbers are used. A number is always a single token already converted by lex.py (line 44)

41         elif self.getSpec(['-']) :
42             calc = Calc(self.pc, self.tokens).getUnit()
43             if calc == None : return self.fatal("getUnit","Invalid expr")
44             self.pc = calc.pc
45             return self

However there two more possiblities for what a unit is. In line 41 a minus sign indicates a uniary minus. The real unit follows and its value negated. This also makes numbers with a leading minus work the way they should.

46         elif self.getSpec(['(']) :
47             calc = Calc(self.pc, self.tokens).getExpr()
48             if calc == None : return self.fatal("getUnit","Invalid sub-expr")
49             self.pc = calc.pc
50             if self.getSpec([')']) :
51                 return self
52             else :
53                 self.fatal("getUnit","no closing paran")
54         else :
55             return self.fatal("getUnit","bad token")

Finally, and most interesting of all is an expression enclosed in parantheses. This, indeed is itself a unit. The inner expression calculates a numeric value which is then used just a a number would be. And of course an expression in parentheses may contain further sub-expressions. This becomes a recursive calculation. But since the pieces are being handled in seperate Calc instances, it doesn't feel like recursion.

Finally, there are two methods that don't spawn additional instances. getSpec does lookahead to check if the next token is among the valid possibilites passed. If it is, then the token is returned and the pc advanced. If not None is returned.

57     def getSpec(self, valid) :
58         spec = self.tokens[self.pc]
59         if spec in valid :
60             self.pc += 1
61             return spec
62         else : return None
63

The method fatal backs out of the program. This happens when there is a syntax error in the program. For each nested calc instance it prints a program snippet to the right of the its pc and in turn fails for its caller.

64     def fatal(self, who, mesg="") :
65         words = map(lambda x: str(x), self.tokens[self.pc:])
66         code = " ".join(words[:10])
67         print "Syntax error:", who, mesg, "-->", code
68         return None
69
70 if __name__ == "__main__" : main.main(Calc)

The main function is in a seperate module and is used for the syntax checker, the interpreter and the compiler - all three. A bit upside-down coding but it was fun to do. The reasons will become clear it a bit.

01 import lex, calculate, sys
02
03 def main(Calc) :
04     prog = " ".join(sys.argv[1:])
05     tokens = lex.lexScan(prog) + ['$']  # Add eof marker
06     print "Input tokens are:", tokens
07     calc = Calc(0, tokens).getExpr()
08     if not calc : print "Compilation failed"
09     else :
10         print "Input left over:", tokens[calc.pc:]
11
12         if calc.value != None :
13             print "Interpreter evaluates:",calc.value
14         elif calc.code :
15             print
16             value = calculate.calculate(calc.code)
17             print "Compiled code evaluates:",value
18         else :
19             print "Syntax check passed"

The input is gathered from the command line for easy testing. Tokens are extracted with the lexScan function and printed. The initial Calc instance is passed the input with pc set to 0 (the start). A special end-of-file code $ lets us see whether all the input was used up in syntax checking the expression.

Testing the Syntax Checker

Here are a couple of tests

$ python syntax.py "(5+4)*-3"
Input tokens are: ['(', 5, '+', 4, ')', '*', '-', 3, '$']
Input left over: ['$']
Syntax check passed

$ python syntax.py "(5+4b)*-3"
Input tokens are: ['(', 5, '+', 4, 'b', ')', '*', '-', 3, '$']
Syntax error: getUnit no closing paran --> b ) * - 3 $
Syntax error: getTerm no first unit --> ( 5 + 4 b ) * - 3 $
Syntax error: getExpr no first term --> ( 5 + 4 b ) * - 3 $
Compilation failed

In the second example a syntax error was deliberately inserted to show the cascading failure. Here is the code for syntax.py.

From Syntax Checker to Interpreter

We can modify the Calc class to compute the value of our input expression and print it to the screen. This can be done while the syntax is being scanned. Here are the changes to the getUnit method.

42     def getUnit(self) :
43         tok = self.tokens[self.pc]
44         if type(tok) == typeInt :
45             self.value = tok      ### New. set to integer
46             self.pc += 1
47             return self
48         elif self.getSpec(['-']) :
49             calc = Calc(self.pc, self.tokens).getUnit()
50             if calc == None : return self.fatal("getUnit","Invalid expr")
51             self.value = -calc.value  ### New
52             self.pc = calc.pc
53             return self
54         elif self.getSpec(['(']) :
55             calc = Calc(self.pc, self.tokens).getExpr()
56             if calc == None : return self.fatal("getUnit","Invalid sub-expr")
57             self.value = calc.value  ### New
58             self.pc = calc.pc
59             if self.getSpec([')']) :
60                 return self
61             else :
62                 self.fatal("getUnit","no closing paran")
63         else :
64             return self.fatal("getUnit","bad token")

In the three lines 45,51 and 57 you see self.value is being updated. When a sub-calculation returns, the caller incorporates its value attribute into the larger whole.

Now here are the changes to the method getExpr

14     def getExpr(self) :
15         calc = Calc(self.pc, self.tokens).getTerm()
16         if calc == None : return self.fatal("getExpr","no first term")
17         self.value = calc.value ### New
18         self.pc   = calc.pc
19         while True :
20             oper = self.getSpec(("+","-"))
21             if not oper : return self
22             calc = Calc(self.pc, self.tokens).getTerm()
23             if calc == None : return self.fatal("getExpr","no expected term")
24             if oper == "+" : self.value += calc.value  ### New
25             if oper == "-" : self.value -= calc.value  ### New
26             self.pc = calc.pc

And the same changes are made to the method getTerm. That's it. Here is the code for interpret.py.

And here is a sample run.

$python interpret.py "(5+4)*-3"
Input tokens are: ['(', 5, '+', 4, ')', '*', '-', 3, '$']
Input left over: ['$']
Interpreter evaluates: -27

From Interpreter to Compiler

With a few modifications to the above modifications we can turn interpret.py into compile.py. Instead of evaluating an expression the compiler will generate "byte code" to do the evaluation much more efficiently. As discussed earlier this is very common. Python first compiles source code in a .py file into "byte code" in a ".pyc".

Here is the compiler program with changes arising from the code generation. Methods getSpec and fatal are unchanged and not shown

01 # compile.py
02
03 import sys, main
04
05 typeInt = type(5)
06
07 class Calc :
08     def __init__(self, pc, tokens) :
09         self.pc = pc
10         self.tokens = tokens
11         self.value  = None
12         self.code   = []
13
14     def getExpr(self) :
15         calc = Calc(self.pc, self.tokens).getTerm()
16         if not calc.code: fatal("getExpr")
17         self.code = calc.code
18         self.pc   = calc.pc
19         while True :
20             oper = self.getSpec(("+","-"))
21             if not oper : return self
22             calc = Calc(self.pc, self.tokens).getTerm()
23             if not calc.code : fatal("getExpr","")
24             self.code += calc.code
25             self.code += oper
26             self.pc = calc.pc
27
28     def getTerm(self) :
29         calc = Calc(self.pc, self.tokens).getUnit()
30         if not calc.code : fatal("getTerm")
31         self.code = calc.code
32         self.pc   = calc.pc
33         while True :
34             oper = self.getSpec(('*','/'))
35             if not oper : return self
36             calc = Calc(self.pc, self.tokens).getUnit()
37             if not calc.code : fatal("getTerm")
38             self.code += calc.code
39             self.code.append(oper)
40             self.pc = calc.pc
41
42     def getUnit(self) :
43         tok = self.tokens[self.pc]
44         if type(tok) == typeInt :
45             self.pc += 1
46             self.code.append(tok)
47             return self
48         elif self.getSpec(['-']) :
49             calc = Calc(self.pc, self.tokens).getUnit()
50             if not calc.code : self.fatal("getUnit","Invalid expr")
51             self.code += calc.code
52             self.code.append('negate')
53             self.pc = calc.pc
54             return self
55         elif self.getSpec(['(']) :
56             calc = Calc(self.pc, self.tokens).getExpr()
57             if not calc.code : self.fatal("getUnit","Invalid sub-expr")
58             self.code += calc.code
59             self.pc = calc.pc
60             if self.getSpec([')']) :
61                 return self
62             else :
63                 self.fatal("getUnit","no closing paran")
64         else :
65             self.fatal("getUnit","syntax error")

Each Calc instance has a list code which will contain the byte-code commands to perform its bit of computation. When a sub-calculation returns its code is appended to whatever was there. If it's the first sub-calculation it can simply be copied. When you see self.code=calc.code or self.code += calc.code that's what is happening. Operator codes are appended to self.code as appropriate.

Let's run a couple of tests.

$ python interpret.py "-(4+7*2)*6"
Input tokens are: ['-', '(', 4, '+', 7, '*', 2, ')', '*', 6, '$']
Input left over: ['$']
Interpreter evaluates: -108

$ python compile.py "-(4+7*2)*6"
Input tokens are: ['-', '(', 4, '+', 7, '*', 2, ')', '*', 6, '$']
Input left over: ['$']

Byte-code: [4, 7, 2, '\*', '+', 'neg', 6, '\*'
Apply token: 4   [4]
Apply token: 7   [4, 7]
Apply token: 2   [4, 7, 2]
Apply token: *   [4, 14]
Apply token: +   [18]
Apply token: neg [-18]
Apply token: 6   [-18, 6]
Apply token: *   [-108]
Compiled code evaluates: -108

The byte-code is a program in postfix format much like the language Forth and also Postscript. That means that values are pushed onto a stack and that operators (like + - * / neg) will pull value(s) from the stack, do the operation and push the result back onto the stack. In the "Apply" comments you can see the token being applied and the state of the stack afterwards.

At the end of the compilation the stack should have just one element, the answer.

Interpreting the Byte Code

But now we need a seperate, but simpler, interpreter for the byte-code. Here is the code for calculate.py. Tokens that are numbers are simply pushed onto the stack. Operators pull their arguments from the stack, do their operation and push back the result. Notice the special operator neg which handles the uniary minus.

01 def calculate(tokens) :
02     stack = []
03     print "Byte-code:", tokens
04
05     for token in tokens :
06         if token == '$' : break
07         if type(token) == type(5) : stack.append(token)
08         elif token == '+' :
09             b = stack.pop()
10             a = stack.pop()
11             stack.append(a+b)
12         elif token == '-' :
13             b = stack.pop()
14             a = stack.pop()
15             stack.append(a-b)
16         elif token == 'neg' :
17             a = stack.pop()
18             stack.append(-a)
19         elif token == '*' :
20             b = stack.pop()
21             a = stack.pop()
22             stack.append(a*b)
23         elif token == '/' :
24             b = stack.pop()
25             a = stack.pop()
26             stack.append(a/b)
27         print 'Apply token: %-3s'%token, stack
28     return stack[0]

In the Real World

This project is only a toy compiler, really just a learning tool to have a peek under the hood. But these mechanisms of searching for syntax and applying semantics (code generation or evaluation) are at the heart of every compiler.

Most compilers are created using with a tool set. The most famous is the orignal Lex (lexical parsing) together with Yacc (yet another compiler compiler). Several open source versions include Flex and Bison. If you are running Unix, you have access to them. To build a compiler for your language you need to supply a grammar (similar to the railway diagrams), and instructions for what to do when reductions are made (similar to our generating self.code). Output is a compiler program incorporating both. Yacc and Lex build the compiler as a C program, Flex and Bison build a compiler in Java.

The constructed compiler is then run on a program in the your new language and the output depends on the code you wrote to handle the reductions. This is most often a meta-code like our postfix above, but much more thorough. Recently, it's become common to output code for existing interpreters like the JVM (Java Virtual Machine) making it possible to use existing system librarys. Clojure (Lisp on the JVM) and Scala (Functional lang) are examples. Elixir is also a functional language that compiles to Beam code built for Erlang. Beam is optimized for concurrency which is a major feature of both Erlang and Elixir. Even the new language Elm compiles to Javascript to get full access to your browser.

All the python code for this project is available in cfsm.zip

Copyright © 2017 Chris Meyers