Building Compilers and Interpreters

Introduction

There is a wonderful evolution of the art of programming languages that continues even today.

In the early days of programming, people wrote programs in pseudo code which they would then translate by hand to the computers numeric machine code. Later assembler programs which could assemble machine numeric instructions from pneumonics (like ADD for add, or MUL for multiply) along with names that would be assigned to memory locations. It kept track of the addresses so that transfer instructions could be properly built

So for example if you (the programmer) wanted to compute a formula like

2 + 4 * 6

then a coder (perhaps someone else) might write a snippet like the following.

LDA  #4     ; load accumulator with number 4
MUL  #6     ; multiply by 6 leaving result in the accumulator
ADD  #2     ; add two to the product - result in accumulator

and an "assembler" program would translate this into machine code (just binary numbers).

Of course the comments starting with the ";" would be more sophisticated since anyone would actually be able to read this code would know fully well what LDA, MUL and ADD mean.

But if you wanted instead to calculate

(2 + 4) * 6

then the translation would need to look like

LDA  #4     ; load accumulator with number 4
ADD  #2     ; add two
MUL  #6     ; multiply by 6 leaving result in the accumlator

to force the addition to take place first.

The early language Fortran (for Formula-Translation) seemed somewhat magical for its time since it could do this (and more) automatically. Formulas (also called Expressions) could be nested with parantheses to any depth. They were recursive. Automatically dealing with such recursion is most easily done with a language that supports recursive functions. Early versions of Fortran did not have this feature. But others like Lisp and Snobol did.

Later Algol and Pascal came along supporting recursion and also having good control structures. It became fairly easy to write programs to handle recursive expressions and also recursive data structures.

Our aim in this project is quite modest. We'll demonstrate a set of small programs with functions that compute numeric formulas with add, subtract, multiply and divide operations and also handle sub-expressions in parantheses. The input is a character string with an expression in the usual infix notation. The output will be a its computed value.

The first program (the lexer) will transform the string into a token stream. Tokens are numbers, words and operators.

The second program (the interpreter) will parse the token stream and compute the result of the expression on the fly.

The third program (the compiler) will parse the token stream and output simpler code that is non-recusive and can in turn be interpreted by ...

The fourth program (the calculator) which is very fast, mostly because the recursion has disappeared.

Today, even in languages we consider interpreted use the method in the third and fourth programs. The "simpler code", or "bytecode" is what we find in our .pyc file first generated by Python and then run.

Lexical Analysis

The first task is to convert an input formula from a character stream into a token stream that is more compact and easier to deal with. This is best shown with an example.

"541 + (37-2)"                     # expression in infix (operators between arguments)

541 + ( 37 - 2 )                   # tokens seperated by spaces

[541, '+', '(' ,37, '-', 2, ')']   # tokens as a Python list

In this token list integers have been evaluated and made a single token. Operators, keywords and variable names are reduced to individual strings. Whitespace is discarded and in actual lexers comments would also be discarded. Although not part of our lexer, extra whitespace and comments would be discarded.

You can download the code lexer.py

 1 # lexer.py
 2
 3 from string import ascii_letters, digits, whitespace
 4
 5 wordStart = ascii_letters + "_"
 6 wordInner = wordStart + digits
 7 dualOps   = ('**','<=','==','&&','||')
 8
 9
10 def getDigit(p,prog) :
11     if prog[p] in digits : return (p+1,prog[p])
12     else                 : return (p,None)
13
14 def getNumber(p,prog) :               # Gather digits into integer
15     p,dig = getDigit(p,prog)
16     if dig == None : return (p,None)  # no leading digit
17     number = dig
18     while True :
19         p,dig = getDigit(p,prog)
20         if dig == None : return (p,int(number))
21         number += dig         # gather ascii digits
22
23 def getWord(p,prog) :                 # Gather characters into words
24     if prog[p] in wordStart :
25         word = prog[p]
26         p += 1
27         while prog[p] in wordInner :
28             word += prog[p]
29             p += 1
30         return (p,word)
31     return (p,None)
32
33 def lexScan(prog) :
34     prog += '$'     # eof marker
35     tokens = []
36     p = 0
37     while True :
38         while prog[p] in whitespace : p += 1
39         if (prog[p] == '$') : return tokens
40         p,word = getWord(p,prog)
41         if word : tokens.append(word)
42         else :
43             p,number = getNumber(p,prog)
44             if number != None : tokens.append(number)
45             elif prog[p:p+2] in dualOps :
46                 tokens.append(prog[p:p+2])  # "**" "<=" etc
47                 p = p+2
48             else :
49                 tokens.append(prog[p])
50                 p = p+1
51     return tokens
52

Functions getDigit, getNumber and getWord are passed the program, a character string and a pointer p showing where to start. If a function is successful, it returns a value and p updated to the next point in the program. If not successful, a function returns None and the pointer unchanged. See lines 16 and 20.

getWords extracts keywords, names of variables, constants or functions. These strings may contain letters, digits or the underscore characture, but may not begin with a digit.

At line 45, if neither a word or number is found, the lexical scanner will assume the character at the pointer and perhaps even the following character is a build-in operator. The dual character operators are defined in a lookup at line 7. We won't actually use (or support) any of these.

To keep things simple there is no code here for stripping comments or taking any values other than positive integers. Here is an example run.

$ python lexer.py "2 * pi*radius*radius"
[2, '*', 'pi', '*', 'radius', '*', 'radius']

Structure of Program Parsing

There are 3 major functions in both our interpreter and compiler

getExpr extracts expressions which are terms added or subtracted

getTerm extracts terms which are units multiplied or divided

getUnit extracts units which are integers or tah-dah expressions enclosed in parentheses

The last bit where sub-expressions become units makes for recursive structures.

We'll see that the getExpr and getTerm functions are very similar. However they impose the precedence between multiply/divide and add/subract. In real languages there are usually many more precedence levels as well, such as exponentiation, logical operators not/and/or and even bitwise operators.

The interpreter program

The lexers output becomes the input for the interpreter. Here is the code

You can download interpreter.py

 1 # interpreter.py
 2 #
 3 import sys
 4 typeInt = type(5)
 5
 6 def getExpr(tokens) :
 7     value,rest = getTerm(tokens)
 8     if value == None : return None, rest
 9     while True :
10         oper,rest = getOper(rest, ["+","-"])
11         if not oper : return value,rest   # good return
12         val2,rest = getTerm(rest)
13         if val2 == None : fatal("getExpr","no following unit")
14         if oper == "+" : value += val2
15         if oper == "-" : value -= val2
16

Function getExpr is called from main to start the interpretation going. It gathers a series of terms seperated by + or - signs. It may also be a single term. The value computed is returned at line 11. The first term at line 7 assigns an initial amount to value and subsequent terms add to subtract to the value at lines 14 and 15.

17 def getTerm(tokens) :
18     value,rest = getUnit(tokens)
19     if value == None : return None, rest
20     while True :
21         oper,rest = getOper(rest, ["*","/"])
22         if not oper : return value, rest   # good return
23         val2,rest = getUnit(rest)
24         if val2 == None : fatal("getTerm","no following unit")
25         if oper == "*" : value *= val2
26         if oper == "/" : value /= val2
27

Function getTerm is basically the same except that add and subtract are replaced with multipy and divide.

28 def getUnit(tokens) :
29     if not tokens : return (None, [])
30     tok,rest = tokens[0], tokens[1:]
31     if type(tok) == typeInt : return tok,rest
32     elif tok == '-' :
33         unit,rest = getUnit(rest)
34         if unit == None : fatal("getUnit","Invalid expr")
35         else : return -unit,rest
36     elif tok == "(" :
37         expr,rest = getExpr(rest)
38         if expr == None : fatal("getUnit","Invalid sub-expr")
39         tok,rest = getOper(rest,[')'])
40         if tok : return (expr, rest)
41         else   : fatal("getUnit","no closing paran")
42

Function getUnit is more interesting. If we are out of tokens (line 29) it returns a value of None and []. This indicates the input is dry. The same statement is in lines 9 and 19. This enables multiple returns up to the top level with None indicating failure.

A unary minus may show up at line 32. The following unit is gathered and returned as its negated value.

If the next token is a "(" (line 36), then a sub-expression is evaluated and assuming that its followed by a ")" then the expressions value is returned as the unit.

43 def getOper(tokens, valid) :
44     if tokens and tokens[0] in valid :
45         return tokens[0], tokens[1:]
46     else : return None, tokens
47
48 def fatal(who, mesg="", code="") :
49     print("fatal -- %s %s --> %s" % (who,mesg,code))
50     sys.exit()
51

Function getOper pops a token from tokens if it is in the valid list. If not in the valid list it leaves tokens intact and return None.

Function fatal is similiar to Python's syntax error but simpler. A function name, message is output the program exits. Warning: It does not catch all errors.

52 def main() :
53     import lexer
54     code   = sys.argv[1]
55     tokens = lexer.lexScan(code)
56     print("Lexer gives %s" % tokens)
57     value,rest = getExpr(tokens)
58     print("%s %s" % (value, rest))
59
60 if __name__ == "__main__" : main()
61

Function main get the expression from the command line (enclose in quotes), sends it to lexer.py and then to interpreter.getExpr.

Testing the Interpreter

Here are a couple of tests

$ python interpreter.py "(5+4)*-3"
Lexer gives ['(', 5, '+', 4, ')', '*', '-', 3]
-27 []

$ python interpreter.py "(5+4b)*-3"
Lexer gives ['(', 5, '+', 4, 'b', ')', '*', '-', 3]
fatal -- getUnit no closing paran

In the second example a syntax error was deliberately inserted to generate a failure.

The -27 above is the result of the interpretation. the [] indicates that no tokens remain.

From Interpreter to Compiler

With just a few modifications we can turn interpret.py into compile.py. Instead of evaluating an expression the compiler will generate "bytecode" to can be evaluated much more efficiently. As discussed earlier this is very common. Python first compiles source code in a .py file into "byte code" in a ".pyc".

Here is the compiler program with changes arising from the code generation. Methods getOper and fatal are unchanged and not shown.

You can download compiler.py.

 1 # compiler.py
 2 #
 3 import sys
 4 typeInt = type(5)
 5
 6 def getExpr(tokens) :
 7     code1,rest = getTerm(tokens)
 8     if code1 == None : return None, rest
 9     byteCode = code1
10     while True :
11         oper,rest = getOper(rest, ["+","-"])
12         if not oper : return byteCode,rest   # good return
13         code2,rest = getTerm(rest)
14         if code2 == None : fatal("getExpr","no following unit")
15         byteCode += code2 + (oper,)
16

If you compare function getExpr in interpreter.py with the above, you can see the few changes. In the interpreter the running value is passed back and in lines 14-15 it is added or subtracted to the value already accumulated. Here, in compiler.py at lines 7 and 13 bytecode passed back and appended to prior bytecode. The bytecode becomes a tuple eventually containing the full expression in postfix.

17 def getTerm(tokens) :
18     code1,rest = getUnit(tokens)
19     if code1 == None : return None, rest
20     byteCode = code1
21     while True :
22         oper,rest = getOper(rest, ["*","/"])
23         if not oper : return byteCode, rest   # good return
24         code2,rest = getUnit(rest)
25         if code2 == None : fatal("getTerm","no following unit")
26         byteCode += code2 + (oper,)
27

Exactly the same modifications in getTerm except that the operators are * and /.

28 def getUnit(tokens) :
29     if not tokens : return (None, [])
30     tok,rest = tokens[0], tokens[1:]
31     if type(tok) == typeInt : return (tok,), rest
32     elif tok == '-' :
33         unit,rest = getUnit(rest)
34         if unit == None : fatal("getUnit","Invalid expr")
35         else : return (unit)+("~",),rest
36     elif tok == "(" :
37         byteCode,rest = getExpr(rest)
38         if byteCode == None : fatal("getUnit","Invalid sub-byteCode")
39         tok,rest = getOper(rest,[')'])
40         if tok : return (byteCode, rest)
41         else   : fatal("getUnit","no closing paran")
42

And in getUnit in lines 31, 37 and 40 it's the same story. Bytecode returned instead of unit value.

Let's run a test. The lexer precedes both the interpreter and the compiler.

$ python interpreter.py "-(4+7*2)*6"
Lexer gives ['-', '(', 4, '+', 7, '*', 2, ')', '*', 6]
-108 []


$ python compiler.py "-(4+7*2)*6"
Lexer gives ['-', '(', 4, '+', 7, '*', 2, ')', '*', 6]
(4, 7, 2, '*', '+', '~', 6, '*') []

The bytecode is the expression in postfix format. Postfix is sometimes used directly in some languages like Forth and Postscript. In postfix values precede the operators. So infix 5 + 4 becomes postfix 4 5 +.

Executing the Byte Code

Now we need a seperate, but simpler, interpreter for the byte-code.

The calculator uses a stack for the computation. Bytecode tokens that are numbers are simply pushed onto the stack. A bytecode token that is an operator pulls its arguments from the stack, does the single computation and pushes the result back onto the stack.

You can download calculator.py.

 1 # calculator.py
 2 #
 3 def calculate(tokens) :
 4     stack = []
 5     for token in tokens :
 6         if token == '$' : break
 7         if type(token) == type(5) : stack.append(token)
 8         elif token == '+' :
 9             b = stack.pop()
10             a = stack.pop()
11             stack.append(a+b)
12         elif token == '-' :    # binary minus
13             b = stack.pop()
14             a = stack.pop()
15             stack.append(a-b)
16         elif token == '~' :    # unary minus
17             a = stack.pop()
18             stack.append(-a)
19         elif token == '*' :
20             b = stack.pop()
21             a = stack.pop()
22             stack.append(a*b)
23         elif token == '/' :
24             b = stack.pop()
25             a = stack.pop()
26             stack.append(a/b)
27         print('Apply token: %-3s %s' % (token, stack))
28     return stack[0]
29

$python calculator.py "-(4+7*2)*6"
Lexed ['-', '(', 4, '+', 7, '*', 2, ')', '*', 6]
Bytecode (4, 7, 2, '*', '+', '~', 6, '*')
Apply token: 4   [4]
Apply token: 7   [4, 7]
Apply token: 2   [4, 7, 2]
Apply token: *   [4, 14]
Apply token: +   [18]
Apply token: ~   [-18]
Apply token: 6   [-18, 6]
Apply token: *   [-108]
Value -108

Notice the special operator "~" (tilde) at line 16 which is used for the uniary minus.

In the Real World

This project is only a toy compiler, really just a learning tool to have a peek under the hood. Yet these mechanisms of searching syntax and applying semantics (code generation or evaluation) are at the heart of every language.

Most compilers are created using with a tool set. The most famous is the orignal Lex (lexical ) together with Yacc (yet another compiler compiler). Several open source versions include Flex and Bison. If you are running Unix, you have access to them. To build a compiler for your language you need to supply a grammar in BNF format, and instructions for what to do when reductions are made (similar to our bytecode generation). Output is a compiler program incorporating both. Yacc and Lex build the compiler as a C program, Flex and Bison build a compiler in Java.

For our toy grammar the BNF might look like:

expression :==  term | term + expression | term - expression
term       :==  unit | unit * term       | unit / expression
unit       :==  number   | - number | ( expression )  | - ( expression )

The constructed compiler is then run to compile or interpret a program in the your new language. The output depends on the code you wrote to handle the reductions. This is most often a meta-code like our postfix above, but much more involved. Recently, it's become common to output code for existing interpreters like the JVM (Java Virtual Machine) making it possible to use existing system librarys. Clojure (Lisp on the JVM) and Scala (Functional language on JVM) are examples. Elixir is also a functional language that compiles to Beam code built for Erlang. Beam is optimized for concurrency which is a major feature of both Erlang and Elixir.

Several new languages such as Typescript and Elm "compile" to Javascript to run in the browser. Both of these languages are strongly typed. Elm is close to ML or Haskell and also greatly simplifies callbacks by using signal processing; a new paradigm. Typescript introduced the traditional class syntax which later was incorporated into javascript as well.

All the python code for this project is available in cfsm.zip

If you have comments or suggestions You can email me at mail me

* * *