Text Processing: A Simple Pattern Matcher

The Motivation

In this chapter we will develop a small and fast pattern matcher that is modeled on some real code that was used in our newpaper's wire service programs to detect story headers and trailers. This was in our early typesetting system and as in the spelling checker program it was fairly critical to keep the footprint as small as possible both in terms of memory and demand on the CPU.

The algorithm was written in assembler for the PDP-11 and only about 20 instructions long. Yet it could match multiple patterns in the input simultaneously even when these patterns were intersperced with each other.

When we developed a new wire service program in the mid eighties, we kept the algorithm but converted it to C. This ran nearly as fast. I would like to demonstrate the algorithm in Python first and then in two versions of C, one oriented around arrays and other using pointers. It is interesting to do some timing runs and see how Python stacks up to C and how C stacks up C optimized with pointers. Finally, we can use the Standalone Python Tutor project to model the Python version of the algorithm in a html "slide show".

Actually, lets start with the last item first so we can see the algorithm working independent of the code. We're going to match for patterns stop, pit, and top in an input stream consisting of "stopit top".

When the program starts the patterns are loaded and it is ready to receive input.

Figure 1

The working space "Partial matches" is empty. When a input character matches the start of a pattern and new partial match is created. This contains a reference to the pattern and to the last character matched

Figure 2

Subsequent input characters may keep the partial matches alive as we see below. New partial matches may also be started.

Figure 3

If the next character in the input is not what a partial match needs, it is simply abandoned. If the character completes the pattern a callback function is invoked and the the partial match is abandoned.

Figure 4

To see all 37 slides and play them back and forth, Click Here and use the back and forward buttons or the slider just above them. Click the browser back button to return to this page.

Python Implementation

My hope is that seeing the above visualization will make understanding the implementation of the algorithm and of course the algorithm itself quite straight-forward.

For the Python code without producing the visualization Click here

The class PatMatch set up a matcher that contains a set of patterns to match, a working area for partial matches (up to 50 at a time) and a simple counter to count matches. Each pattern is saved along with a callback function, although in this program only a single function callBack is used to announce the match.

class PatMatch :
   def __init__ (self) :
       self.patterns = []       # To match
       self.working  = [0]*50   # Partial matches
       self.worktop  = 0        # Limit for partials
       self.nMatches = 0        # Number matches found so far

   def addPattern(self, pattern, callBack) :
       # Add a pattern with its call back function
       self.patterns.append( (pattern, len(pattern), callBack) )

   def reset (self) :
       # Cancel all partial matches
       self.worktop  = 0
       self.nMatches = 0

The heart of the algorithm is in the seeChar method. Here an input character ch is processed. If it starts any new pattern then the pattern is loaded into the working area. Once any new patterns are loaded, all are checked to see if the new character is compatible. If it is then the partial match is updated or completed, otherwise the partial match is discarded.

Discarding partial matches is done very efficiently. Partial matches that survive simply overwrite the discards and the work area is downsized accordingly. When a discard is copied over, it is only a pointer to a list that is copied; not the creation of a new list or new string. It's all very very fast.

def seeChar (self,ch) :
    # a new char to be handled
    # if char starts a pattern - add it to top of workspace
    for p in self.patterns :
        if ch == p[0][0] :
            # partial match has pattern pointer and # chars matched
            self.working[self.worktop] = [p, 0]
            self.worktop += 1

    # scan through work area to adjust to char ch
    wp = wk = 0                   #
    while wp < self.worktop :
        # wpat is pattern, wcp chars matched so far
        wpat,wcp = self.working[wp]
        if wpat[0][wcp] == ch :
            # new character continues the match
            wcp += 1
            if wcp >= wpat[1] :
                # Oh. the new char has completed match
                wpat[2](wpat[0])   # Do the call back
                self.nMatches += 1
            else :
                # Still more to match. tuck partial match in
                self.working[wk] = (wpat,wcp)
                wk += 1  # Adjust new worktop, keeping partial
        wp += 1
    self.worktop = wk    # set worktop to matches kept

A test function takes patterns on the command line, input from standard input and outputs to the standard output. We can run the simulation with the following

$ echo "stopit top" | python patMatch.py 1 1 stop top pit
stop<stop><top>it<pit> top<top>
4 matches: 0.000102996826172 secs

The 4 matches are in angle brackets directly after the character that triggered the match.

Testing for Timing

Running patMatch.py uses command line arguments. The 1st is zero or one; one means output input and matches to stdout. The 2nd is also zero or one; one means send the input to the pattern matcher. The remaining arguments are the patterns to be matched.

The file stopit.dat just repeats the "stopit top" we used with the echo command but enough times to get some reliable timing.

The first run below is just to get an idea of the overhead of input the characters alone. This should be subtracted from the second run which also sends the 15,000 characters to the pattern matcher. The latter took about 54 milliseconds and about 88% of that was spent in the pattern matcher.

$ cat stopit.dat | python patMatch.py 0 0 stop top pit
0 matches: 0.00759887695312 secs

$ # send to the pattern matcher but no echo to output
$ cat stopit.dat | python patMatch.py 0 1 stop top pit
6567 matches: 0.0542540550232 secs

Pattern matching in C

I couldn't resist comparing this to a C implementation; actually two. See the code patMatch.c and timer.c. Instruction for building the executable are in a comment at the top of patMatch.c.

In patMatch.c the function match1 is the same algorithm we had in the seeChar method in Python but using typical array operations. Function match2 uses what should be more efficient pointer logic and more closely resembles the old assembler code. Here are the results.

$ # do NOT send input to the pattern matcher
$ cat stopit.dat | ./patMatch 0 0 stop top pit
0 matches - 0.000666 secs

$ cat stopit.dat | ./patMatch 0 1 stop top pit
6567 matches - 0.001645 secs

$ cat stopit.dat | ./patMatch 0 2 stop top pit
6567 matches - 0.001630 secs

First of all, the basic I/O is about 2.5 times faster than Python. The second run using the match1 function is about 1% slower than the one optimized using pointers. Both are much faster than the Python version. About 40 times faster.

A lot of work went into writing the pointer version and such optimization is generally not time well spent. Modern C compilers including gcc can optimize array operations into pointer logic including the use of fast registers during compliation.

Actually, I ran both C versions of the pattern matcher about 100 times each. The times above were the median average results. There is enough background processing going on in my laptop to make the comparison tricky.

SPT Visualization of the Python Algorithm

The visualization at the first part is done with Standalone Python Tutor. To make sence of the rest of this section you may need to the link.

The program patMatchSpt.py is an altered version of patMatch.py with the following additions.

The basic template for each "slide" is defined in the template. Properties of the view instance will be applied to the template with the usual Python % string operations.

The input stream, patterns and partial matches tables are spt matrices

import spt
template = """
  <h2>Visualization of Pattern Matching</h2>
  <hr>Input stream
  <hr>Partial matches

view = spt.View(template)
tblAttr = 'border="1" cellspacing="0" cellpadding="4"'

Slides are produced mostly with from within the method seeChar and look like the following

self.makeSlide("New Pattern(s) added to Workspace")

The new PatMatch method makeSlide creates a single slide

def makeSlide(self, message) :
    view.banner = message
    view.stream  = spt.Matrix(data=[list(self.stream)],
    workMat = spt.Matrix(1,1,tableAttr=tblAttr)
    for wp in range(self.worktop) :
        wpat,wcp = self.working[wp]
        workMat[wp,0] = wpat[0]
        workMat.style[wp,0] = 'background-color:lightgreen'
        for c in range(wcp) :
            workMat[wp,c+1] = wpat[0][c]
    view.working = workMat.renderHtml()

To create the visualization the following command was executed. The spt area lies in an adjacent directory on my computer.

$ echo "stopit top" | python ../spt/sptBuild.py patMatchSpt.py 1 1 stop top pit stop<stop><top>it<pit> top<top> 4 matches: 0.00810098648071 secs

The resulting file (about 1/2 megabyte) patMatchSpt.html can be directly loaded to the browser

Copyright © 2014-2015 Chris Meyers