- In the early days of programming, people would write programs in pseudo code which would then be translated to the computers numeric machine code. Perhaps an
*assembler*program which could assemble machine instructions from pneumonics (like ADD for add, or MUL for multiply) along with names that would be assigned to memory locations. - that the assembler would assign numbers to. For example if you (the programmer) wanted to compute a formula like

2 + 4 * 6

then a coder (perhaps someone else) might write a snippet like the following.

LDA #4 ; load accumulator with number 4 MUL #6 ; multiply by 6 leaving result in the accumulator ADD #2 ; add two to the product - result in accumulator

and the assembler program would translate this into machine code.

Of course the comments starting with the ";" would be more sophisticated since anyone would actually be able to read this code would know fully well what LDA, MUL and ADD mean.

But if you wanted instead

(2 + 4) * 6

then the snippet would look like

LDA #4 ; load accumulator with number 4 ADD #2 ; add two MUL #6 ; multiply by 6 leaving result in the accumlator

to force the addition to take place first.

The early language Fortran (for Formula-Translation) seemed a little magical for the standards of the time since it could do this (and more) automatically. Formulas (also called Expressions) could be nested with parantheses to any depth. Automatically dealing with such expressions is most easily done with a language that supports recursive functions. Early versions of Fortran did not.

Pascal came along a few years later which did support recursion, although a bit awkwardly. So, suddenly it became fairly easy to write programs to handle recursive data structures.

Pascal had something else going for it. An interesting technique was used to formally describe the Pascal syntax. Boxes connected with labeled arrows represented states and by looking at the next item in the input stream could choose an arrow leading to a new box if the arrows label contained the next item. The boxes were formed into groups that satisfied a syntactical element. An unlabled arrow could leave the group successfully. Entry from the left, Exit from the right. Here is a tiny example for the group *unsigned integer* which uses another group *digit*.

unsigned integer -----> digit --------> ^ | | | (0-9) |_________|

which means any consecutive string of digits (0-9) with at least one digit satifies the finding of a *unsigned integer* group. The exit arrow to the right, indicating success, is only taken when no other arrow can be used. Being unable to follow any arrow signifies a *syntax error* in the input.

Ten pages of such "railway boxes" suffice to describe the syntax of the complete pascal language. Here is an example of the Pascal railway .

It's not hard to translate such boxes directly into a Python program by simply converting each group into a function. Python, like all modern computer languages, supports recursion fine so these boxes may use each other recursively as well.

Our aim in this project is quite modest. We'll write functions for agents that compute numeric formulas with add, subtract, multiply and divide operations. We'll write a program that emulates "railroad boxes" and comes in 3 slightly different versions.

The first will check if a formula is syntatically valid, the second will act as an *interpretor* to evaluate the formula and the third will be a *compiler* that will translate the formula into something simpler with the same meaning (sematics).

The input formula is a character stream that will first be transformed into a token stream. Tokens are numbers, words and operators.

The token stream is parsed by our program from left to right looking ahead at just the next token and deciding what to do based on where it is and what the token is.

As blocks are completed, there is a return to the calling block (reduction) which may produce interpretation or translation.

Our first task is to convert an input formula from a character stream into a *token* stream that is more compact and easier to deal with. This is best shown with an example.

"541 + (37-2)" # character stream (input program) 541 + ( 37 - 2 ) # tokens seperated by spaces [541, '+', '(' ,37, '-', 2, ')'] # tokens as a Python list

When reduced to a Python list integers are evaluated and become a single token. Operators, keywords and variable names are also reduced to individual strings. Although not part of our *lexer*, extra whitespace and comments would be discarded.

Here is most of the code from lex.py

01 # lex.py 02 03 from string import ascii_letters, digits, whitespace 04 05 wordStart = ascii_letters + "_" 06 wordInner = wordStart + digits 07 dualOps = ('**','<=','==','&&','||') 08 09 10 def getDigit(p,prog) : 11 if prog[p] in digits : return (p+1,prog[p]) 12 else : return (p,None) 13 14 def getNumber(p,prog) : 15 p,dig = getDigit(p,prog) 16 if dig == None : return (p,None) # no leading digit 17 number = dig 18 while True : 19 p,dig = getDigit(p,prog) 20 if dig == None : return (p,int(number)) 21 number += dig # gather ascii digits 22 23 def getWord(p,prog) : 24 if prog[p] in wordStart : # Python: alphabetic or "_" 25 word = prog[p] 26 p += 1 27 while prog[p] in wordInner : # alphanumeric 28 word += prog[p] 29 p += 1 30 return (p,word) 31 return (p,None) 32 33 def lexScan(prog) : 34 prog += '$' # eof marker 35 tokens = [] 36 p = 0 37 while True : 38 while prog[p] in whitespace : p += 1 39 if (prog[p] == '$') : return tokens 40 p,word = getWord(p,prog) 41 if word : tokens.append(word) 42 else : 43 p,number = getNumber(p,prog) 44 if number != None : tokens.append(number) 45 elif prog[p:p+2] in dualOps : 46 tokens.append(prog[p:p+2]) # "**" "<=" etc 47 p = p+2 48 else : 49 tokens.append(prog[p]) 50 p = p+1 51 return tokens

Functions *getDigit*, *getNumber* and *getWord* are passed the *program*, a character string and a pointer *p* showing where to start. If a function is successful, it returns a value and *p* updated to the next point in the program. If not successful, a function returns *None* and the pointer unchanged. See lines 16 and 20.

Words can be language keywords, names of variables, constants or functions. These strings may contain letters, digits or the underscore characture, but may not begin with a digit.

At line 43, if neither a *word* or *number* is found, the lexical scanner will assume the character at the pointer and perhaps even the following character build an operator. The dual character operators are defined in a lookup at line 7.

There is no code here for stripping comments or taking any values other than positive integers. Here is an example run.

$ python lex.py "2 * pi*radius**2" 2 * pi*radius**2 [2, '*', 'pi', '*', 'radius', '**', 2]

Following the model of the Pascal syntax diagrams, we convert diagrams into functions that gather sub-diagrams as needed. If a syntax error occurs our interpreter or compiler will simply stop and point to the current location of the error, much like Python does.

Usually, functions would be processed recursively but I want to take another approach. A class *Calc* contains a method for each diagram and an instance of *Calc* will process tokens to traverse the diagram to its exit. A given method can spawn more *Calc* instances as helpers.

During the computation there is a chain of *Calc* instances that grows and shrinks as the interpretation or compilation of the input tokens proceeds. This is equivalent to recursive functions but perhaps easier to visualize.

The *Calc* class has methods for the following

getExpr - calc value of sum of terms.

getTerm - calc value of product of units

getUnit - return value of integer, expression in (), or negated Unit

An expression consists of *terms* seperated by "+" or "-" operators indicating addition or subtraction. A term consists of *units* seperated by the operators "*" or "/" for multiplication or division. A unit, to begin with, is simply an unsigned integer.

So in the expression "45+14*2" the units are "45", "14" and "2". The terms are "45" and "14*2".

01 # syntax.py 02 03 import sys, main 04 05 typeInt = type(5) 06 07 class Calc : 08 def __init__(self, pc, tokens) : 09 self.pc = pc 10 self.tokens = tokens 11 self.value = None 12 self.code = [] 13

Attributes for a *Calc* instance are the following. *tokens* is the output from lexical scanner, basically the tokenized input program. *pc* is the program counter and points to the next token for this instance to process. *pc* is updated as it consumes tokens. The attributes *value* and *code* will be discussed later.

14 def getExpr(self) : 15 calc = Calc(self.pc, self.tokens).getTerm() 16 if calc == None : return self.fatal("getExpr","no first term") 17 self.pc = calc.pc 18 while True : 19 oper = self.getSpec(("+","-")) 20 if not oper : return self 21 calc = Calc(self.pc, self.tokens).getTerm() 22 if calc == None : return self.fatal("getExpr","no expected term") 23 self.pc = calc.pc 24

The method *getExpr* will gather the values of terms seperated by *+* and *-*. To succeed there must be at least one term (checked at line 17). More terms found in the *while* loop will update *self.value*. Notice how terms are handled by sub-instances of *Calc* in lines 15 and 22 and the operator applied. A new *Calc* instance is spawned and immediately invoked to find a *term* in the input. If it's successful a pointer to the new instance is returned, otherwise *None* is returned. If successful then the old *value* and *pc* in the caller are updated. The new instance is discarded when *getExpr* returns.

25 def getTerm(self) : 26 calc = Calc(self.pc, self.tokens).getUnit() 27 if calc == None : return self.fatal("getTerm","no first unit") 28 self.pc = calc.pc 29 while True : 30 oper = self.getSpec(("*","/")) 31 if not oper : return self 32 calc = Calc(self.pc, self.tokens).getUnit() 33 if calc == None : return self.fatal("getTerm","no expected unit") 34 self.pc = calc.pc 35

The method *getTerm* is almost a copy of *getExpr* except that the operators are now multiply and divide and *getTerm* looks for units instead of terms. With the code structured like this the multiplys and divides are done before the adds and subtracts honoring the expected precedence.

We're going to look at *getUnit* in chunks. It's the most interesting method.

36 def getUnit(self) : 37 tok = self.tokens[self.pc] 38 if type(tok) == typeInt : 39 self.pc += 1 40 return self

Basic units are numbers, constants or variables. In today's code only unsigned numbers are used. A number is always a single token already converted by lex.py (line 44)

41 elif self.getSpec(['-']) : 42 calc = Calc(self.pc, self.tokens).getUnit() 43 if calc == None : return self.fatal("getUnit","Invalid expr") 44 self.pc = calc.pc 45 return self

However there two more possiblities for what a unit is. In line 41 a minus sign indicates a uniary minus. The real unit follows and its value negated. This also makes numbers with a leading minus work the way they should.

46 elif self.getSpec(['(']) : 47 calc = Calc(self.pc, self.tokens).getExpr() 48 if calc == None : return self.fatal("getUnit","Invalid sub-expr") 49 self.pc = calc.pc 50 if self.getSpec([')']) : 51 return self 52 else : 53 self.fatal("getUnit","no closing paran") 54 else : 55 return self.fatal("getUnit","bad token")

Finally, and most interesting of all is an expression enclosed in parantheses. This, indeed is itself a unit. The inner expression calculates a numeric value which is then used just a a number would be. And of course an expression in parentheses may contain further sub-expressions. This becomes a recursive calculation. But since the pieces are being handled in seperate *Calc* instances, it doesn't feel like recursion.

Finally, there are two methods that don't spawn additional instances. *getSpec* does lookahead to check if the next token is among the valid possibilites passed. If it is, then the token is returned and the *pc* advanced. If not *None* is returned.

57 def getSpec(self, valid) : 58 spec = self.tokens[self.pc] 59 if spec in valid : 60 self.pc += 1 61 return spec 62 else : return None 63

The method *fatal* backs out of the program. This happens when there is a syntax error in the program. For each nested *calc* instance it prints a program snippet to the right of the its *pc* and in turn fails for its caller.

64 def fatal(self, who, mesg="") : 65 words = map(lambda x: str(x), self.tokens[self.pc:]) 66 code = " ".join(words[:10]) 67 print "Syntax error:", who, mesg, "-->", code 68 return None 69 70 if __name__ == "__main__" : main.main(Calc)

The *main* function is in a seperate module and is used for the syntax checker, the interpreter and the compiler - all three. A bit upside-down coding but it was fun to do. The reasons will become clear it a bit.

01 import lex, calculate, sys 02 03 def main(Calc) : 04 prog = " ".join(sys.argv[1:]) 05 tokens = lex.lexScan(prog) + ['$'] # Add eof marker 06 print "Input tokens are:", tokens 07 calc = Calc(0, tokens).getExpr() 08 if not calc : print "Compilation failed" 09 else : 10 print "Input left over:", tokens[calc.pc:] 11 12 if calc.value != None : 13 print "Interpreter evaluates:",calc.value 14 elif calc.code : 15 print 16 value = calculate.calculate(calc.code) 17 print "Compiled code evaluates:",value 18 else : 19 print "Syntax check passed"

The input is gathered from the command line for easy testing. Tokens are extracted with the *lexScan* function and printed. The initial *Calc* instance is passed the input with *pc* set to 0 (the start). A special end-of-file code *$* lets us see whether all the input was used up in syntax checking the expression.

Here are a couple of tests

$ python syntax.py "(5+4)*-3" Input tokens are: ['(', 5, '+', 4, ')', '*', '-', 3, '$'] Input left over: ['$'] Syntax check passed $ python syntax.py "(5+4b)*-3" Input tokens are: ['(', 5, '+', 4, 'b', ')', '*', '-', 3, '$'] Syntax error: getUnit no closing paran --> b ) * - 3 $ Syntax error: getTerm no first unit --> ( 5 + 4 b ) * - 3 $ Syntax error: getExpr no first term --> ( 5 + 4 b ) * - 3 $ Compilation failed

In the second example a syntax error was deliberately inserted to show the cascading failure. Here is the code for syntax.py.

We can modify the *Calc* class to compute the value of our input expression and print it to the screen. This can be done while the syntax is being scanned. Here are the changes to the *getUnit* method.

42 def getUnit(self) : 43 tok = self.tokens[self.pc] 44 if type(tok) == typeInt : 45 self.value = tok ### New. set to integer 46 self.pc += 1 47 return self 48 elif self.getSpec(['-']) : 49 calc = Calc(self.pc, self.tokens).getUnit() 50 if calc == None : return self.fatal("getUnit","Invalid expr") 51 self.value = -calc.value ### New 52 self.pc = calc.pc 53 return self 54 elif self.getSpec(['(']) : 55 calc = Calc(self.pc, self.tokens).getExpr() 56 if calc == None : return self.fatal("getUnit","Invalid sub-expr") 57 self.value = calc.value ### New 58 self.pc = calc.pc 59 if self.getSpec([')']) : 60 return self 61 else : 62 self.fatal("getUnit","no closing paran") 63 else : 64 return self.fatal("getUnit","bad token")

In the three lines 45,51 and 57 you see self.value is being updated. When a sub-calculation returns, the caller incorporates its *value* attribute into the larger whole.

Now here are the changes to the method *getExpr*

14 def getExpr(self) : 15 calc = Calc(self.pc, self.tokens).getTerm() 16 if calc == None : return self.fatal("getExpr","no first term") 17 self.value = calc.value ### New 18 self.pc = calc.pc 19 while True : 20 oper = self.getSpec(("+","-")) 21 if not oper : return self 22 calc = Calc(self.pc, self.tokens).getTerm() 23 if calc == None : return self.fatal("getExpr","no expected term") 24 if oper == "+" : self.value += calc.value ### New 25 if oper == "-" : self.value -= calc.value ### New 26 self.pc = calc.pc

And the same changes are made to the method *getTerm*. That's it.
Here is the code for interpret.py.

And here is a sample run.

$python interpret.py "(5+4)*-3" Input tokens are: ['(', 5, '+', 4, ')', '*', '-', 3, '$'] Input left over: ['$'] Interpreter evaluates: -27

With a few modifications to the above modifications we can turn interpret.py into compile.py. Instead of evaluating an expression the compiler will generate "byte code" to do the evaluation much more efficiently. As discussed earlier this is very common. Python first compiles source code in a *.py* file into "byte code" in a ".pyc".

Here is the compiler program with changes arising from the code generation. Methods *getSpec* and *fatal* are unchanged and not shown

01 # compile.py 02 03 import sys, main 04 05 typeInt = type(5) 06 07 class Calc : 08 def __init__(self, pc, tokens) : 09 self.pc = pc 10 self.tokens = tokens 11 self.value = None 12 self.code = [] 13 14 def getExpr(self) : 15 calc = Calc(self.pc, self.tokens).getTerm() 16 if not calc.code: fatal("getExpr") 17 self.code = calc.code 18 self.pc = calc.pc 19 while True : 20 oper = self.getSpec(("+","-")) 21 if not oper : return self 22 calc = Calc(self.pc, self.tokens).getTerm() 23 if not calc.code : fatal("getExpr","") 24 self.code += calc.code 25 self.code += oper 26 self.pc = calc.pc 27 28 def getTerm(self) : 29 calc = Calc(self.pc, self.tokens).getUnit() 30 if not calc.code : fatal("getTerm") 31 self.code = calc.code 32 self.pc = calc.pc 33 while True : 34 oper = self.getSpec(('*','/')) 35 if not oper : return self 36 calc = Calc(self.pc, self.tokens).getUnit() 37 if not calc.code : fatal("getTerm") 38 self.code += calc.code 39 self.code.append(oper) 40 self.pc = calc.pc 41 42 def getUnit(self) : 43 tok = self.tokens[self.pc] 44 if type(tok) == typeInt : 45 self.pc += 1 46 self.code.append(tok) 47 return self 48 elif self.getSpec(['-']) : 49 calc = Calc(self.pc, self.tokens).getUnit() 50 if not calc.code : self.fatal("getUnit","Invalid expr") 51 self.code += calc.code 52 self.code.append('negate') 53 self.pc = calc.pc 54 return self 55 elif self.getSpec(['(']) : 56 calc = Calc(self.pc, self.tokens).getExpr() 57 if not calc.code : self.fatal("getUnit","Invalid sub-expr") 58 self.code += calc.code 59 self.pc = calc.pc 60 if self.getSpec([')']) : 61 return self 62 else : 63 self.fatal("getUnit","no closing paran") 64 else : 65 self.fatal("getUnit","syntax error")

Each *Calc* instance has a list *code* which will contain the byte-code commands to perform its bit of computation. When a sub-calculation returns its code is appended to whatever was there. If it's the first sub-calculation it can simply be copied.
When you see *self.code=calc.code* or *self.code += calc.code* that's what is happening. Operator codes are appended to *self.code* as appropriate.

Let's run a couple of tests.

$ python interpret.py "-(4+7*2)*6" Input tokens are: ['-', '(', 4, '+', 7, '*', 2, ')', '*', 6, '$'] Input left over: ['$'] Interpreter evaluates: -108 $ python compile.py "-(4+7*2)*6" Input tokens are: ['-', '(', 4, '+', 7, '*', 2, ')', '*', 6, '$'] Input left over: ['$'] Byte-code: [4, 7, 2, '\*', '+', 'neg', 6, '\*' Apply token: 4 [4] Apply token: 7 [4, 7] Apply token: 2 [4, 7, 2] Apply token: * [4, 14] Apply token: + [18] Apply token: neg [-18] Apply token: 6 [-18, 6] Apply token: * [-108] Compiled code evaluates: -108

The byte-code is a program in postfix format much like the language Forth and also Postscript. That means that values are pushed onto a stack and that operators (like + - * / neg) will pull value(s) from the stack, do the operation and push the result back onto the stack. In the "Apply" comments you can see the token being applied and the state of the stack afterwards.

At the end of the compilation the stack should have just one element, the answer.

But now we need a seperate, but simpler, interpreter for the byte-code.
Here is the code for calculate.py. Tokens that are numbers are simply pushed onto the stack. Operators pull their arguments from the stack, do their operation and push back the result. Notice the special operator *neg* which handles the uniary minus.

01 def calculate(tokens) : 02 stack = [] 03 print "Byte-code:", tokens 04 05 for token in tokens : 06 if token == '$' : break 07 if type(token) == type(5) : stack.append(token) 08 elif token == '+' : 09 b = stack.pop() 10 a = stack.pop() 11 stack.append(a+b) 12 elif token == '-' : 13 b = stack.pop() 14 a = stack.pop() 15 stack.append(a-b) 16 elif token == 'neg' : 17 a = stack.pop() 18 stack.append(-a) 19 elif token == '*' : 20 b = stack.pop() 21 a = stack.pop() 22 stack.append(a*b) 23 elif token == '/' : 24 b = stack.pop() 25 a = stack.pop() 26 stack.append(a/b) 27 print 'Apply token: %-3s'%token, stack 28 return stack[0]

This project is only a toy compiler, really just a learning tool to have a peek under the hood. But these mechanisms of searching for syntax and applying semantics (code generation or evaluation) are at the heart of every compiler.

Most compilers are created using with a tool set. The most famous is the orignal Lex (lexical parsing) together with Yacc (yet another compiler compiler). Several open source versions include Flex and Bison. If you are running Unix, you have access to them. To build a compiler for your language you need to supply a grammar (similar to the railway diagrams), and instructions for what to do when reductions are made (similar to our generating *self.code*). Output is a compiler program incorporating both. Yacc and Lex build the compiler as a C program, Flex and Bison build a compiler in Java.

The constructed compiler is then run on a program in the your new language and the output depends on the code you wrote to handle the reductions. This is most often a meta-code like our postfix above, but much more thorough. Recently, it's become common to output code for existing interpreters like the JVM (Java Virtual Machine) making it possible to use existing system librarys. Clojure (Lisp on the JVM) and Scala (Functional lang) are examples. Elixir is also a functional language that compiles to Beam code built for Erlang. Beam is optimized for concurrency which is a major feature of both Erlang and Elixir. Even the new language Elm compiles to Javascript to get full access to your browser.

All the python code for this project is available in cfsm.zip

Copyright © 2017 Chris Meyers