Text Processing: A Simple Spelling Checker

The Problem

In this project we will develop a small and extremely fast spelling checker that is completely memory resident and with a surprisingly small memory footprint.

The motivation for such a program goes back to the mid-70s when I was working at a newspaper as a young computer programmer. The lack of any form of automatic spell checking meant that human proof readers needed lots of time to carefully find misspelled words and correct them. We wanted a program that could simply flag a word that was not in a lexicon (ours had over 50,000 words).

This is fairly trivial with todays computer technology but a challenge for computers of that time. We had a file of 53,751 common words spell.words that looks like the following.

a
aal
aalii
aaliis
aals
.
.
.
zwieback
zygal
zygon
zygosity
zygotene
zygotic
zymaturgy
zymes
zymic
zymin
.

We can get this into a list in memory with the following

>>> words = open("spell.words").readlines()
>>> words = map(lambda x: x.strip(), words)

The strip function is needed to remove the newline character at the end of each word.

Now we can simply check if a word is in our list

>>> 'zygotic' in words
True

To check each word in a 1000 word story takes some time.

>>> for i in range(1000) : present = 'zygotic' in words
...
>>>

And even on my laptop in 2014 there is about a 2 second pause. Of course, that is the worst case. But todays nanoseconds took microseconds back then.

Today, the timing is really not a problem. In Python we have dictionaries built into the language. So we can simply do the following.

>>> hash = {}
>>> for word in words : hash[word] = True
...
>>>

Now with even 100,000 lookups, the prompt returns almost immediately.

>>> for i in range(100000) : present = hash.has_key('zygotic')

But in the mid-70s this kind of computer power was only a dream. In 40 years we have had about a thousand-fold increase in CPU speed, main memory and disk capacity. Yesterday's megabyte is today's gigabyte. Yesterday, CPU cycles then were measured in microseconds, today in nanoseconds. Our server computer where the spell checking program would run (along with about 20 other programs) was a PDP-11/70 that had only 1.5 megabytes of main memory and precious few CPU cycles to spare.

What we did is similar to having a hash table. A hash function can generate a pseudo-random address for any word in our dictionary. And this can be done inexpensively in terms of time, Since we need only True and False values, we can replace them with '1' and '0'.

So consider a bitmap with a million bits. If each of the 53751 words in our lexicon is used to generate a hash between zero and 999,999 then we can set the corresponding bit in the bitmap to one. And later, to check if some word is in the lexicon we simply generate its hash and see if that bit is set. If it is, then we assume the word is correctly spelled. If not, then it's assumed to be misspelled.

The second assumption is fine. If the bit is not set then the word is clearly not in our lexicon (our definition of misspelled). But it can happen that a truly misspelled word still generates a hash already set by another word (or words) from the lexicon. This results in a "false positive" and cannot be eliminated. But it can be minimized.

If each of the 53751 words generated a unique hash then we would set exactly that many bits. But because of the potential for overlap fewer actual bits are set. The chances of getting a false positive from a misspelled word are a bit less than one in twenty in this configuration.

That's not quite good enough.

One way to improve things might be to simply double the size of the bitmap. Then only about 1 bit in 40 would be set and our rate of false positives would drop by half. Of course, our hash function needs to span the new size.

Even better, and much more interesting, would be to make a second bitmap with a different hash function and let each word in the lexicon set one bit in both. To check a test word we see if both bits are set. Words not in the lexicon now have only a 1/400 (20*20) chance of being a false positive.

But now we've used twice the memory and the result is still only so-so. A interesting compromise is to use just a single bitmap but to still set bits from 2 independent hash functions. Then about 1 bit in 10 will be set and the error rate will be about 1/100 (10*10). Well, that's still better than 1/20.

We can continue this with even more functions. With each one the bit density increases (a bad thing) but the false positive rate continues Here are some results for using 4 to 7 hash functions.

>>> (194320./2**20)**4    # 4 functions setting bits
 0.0011794               #        expected error rate
>>> (236966./2**20)**5    # 5 functions setting bits
 0.0005894
>>> (277378./2**20)**6    # 6 functions setting bits
 0.0003426
>>> (277565./2**20)**7    # 7 functions setting bits
 0.0000911

This is actually done with a bitmap size of 1048576 (2**20). The numbers 194320 to 277565 are the number of bits actually set by our lexicon. Divided by the total bits gives us a basic fraction that is then raised to the appropriate power to produce the expected rate of false positives. For 4 functions this is a bit over 1/1000. By 7 it is about 1/10000. For reasons that will become clear shortly, we'll use 6 functions.

A Bitmap class

Our module bitmap.py defines a class Bitmap that can be used as follows.

>>> import bitmap
>>> b = bitmap.Bitmap(64)
>>> b.map
[0, 0, 0, 0]

On our PDP-11 the bitmap was a array of integers, each with 16 bits. We'll similuate that in Python.

Here, 4 integers give us a bitmap of 64 bits. We can watch the words in b.map change as bits are set:

>>> b.setBit(0)
>>> b.map
[1, 0, 0, 0]
>>> b.setBit(1)
>>> b.map
[3, 0, 0, 0]
>>> b.setBit(63)
>>> b.map
[3, 0, 0, 32768]
>>> b.getBit(63)
True
>>> b.getBit(62)
False

For our spell checker we get our million bit (actually 1048576) bit map as follows

>>> import bitmap
>>> b = bitmap.Bitmap(2**20)
>>> b
<bitmap.Bitmap instance at 0xb724decc>
>>> len(b.map)
65536

The last call to len shows that we are using 65536 (64k) 16 bit words to get our bitmap of 1048576 bits. This was about 12% of the memory in our PDP 11/70.

Getting our Hash Functions

We had studied possible hash functions fairly carefully to make sure that each would give even results across the bitmap and that the functions were independent of each other. For this simulation I'm going to simply use the hashlib module's md5 function to generate a nice 32 digit hexidecimal number (string form) from any input string. I pretty sure it's just fine, especially for this demonstration.

>>> import hashlib
>>> hashlib.md5('zygotic').hexdigest()
'2368434ae5c0612e153f3a37283767bf'

It takes 5 hexidecimal digits to map 2**20 bits (why?). If we slice the first
30 digits into 6 5-digit chunks then we have 6 independent hash numbers to use.

>>> b = '2368434ae5c0612e153f3a37283767bf'
>>> int(b[0:5],16)       # Convert string to integer as hex (base 16)
145028
>>> int(b[05:10],16)
215781
>>> int(b[10:15],16)
787986
>>> int(b[15:20],16)
922943
>>> int(b[20:25],16)
238450
>>> int(b[25:30],16)
538471

You can see that the bits that are to be set for "zygotic" are all in the appropriate range.

A Simple Spell Checker

Our program spellCheck.py pulls all of this together. It first creates and populates a bitmap with all 53751 words. There is a slight pause for that. Then it reads text from standard input. The text is first striped of punctuation and set to lowercase. Then each word is spell checked and output to standard output with the result. The logic of the program should be straightforward.

This is our test file

$cat test.text
Now is the time for the zygotic zwieback to be eeaten.
Don't do it in an air-conditioned settting, ok?

$ python spellCheck.py spell.words <test.text
True   now
True   is
True   the
True   time
True   for
True   the
True   zygotic
True   zwieback
True   to
True   be
False  eeaten
True   don't
True   do
True   it
True   in
True   an
True   air-conditioned
False  settting
True   ok

Comments

In our original implementation we used only 4 hash functions and only 32768 words of memory. That two second pause above that accompanied the populating of the bitmap with the lexicon was about a minute then. That is not a thousand times slower. We also gain about a 30 to 1 advantage in using C or assembler instead of Python. Also, I am not sure how expensive (in time) the md5 function is.

The original program worked as a server program (at the time we called a "spooler"). The bitmap with lexicon loaded resided as a file on the disk. This could be loaded in a second or too. But it was difficult to change the lexicon on the fly. Deleting a mistake meant reloading the entire bitmap. (why?).

For its time, the little program was a dream come true and it spread from our newspaper to about 30 others using the same computer system for typesetting. But within 10-15 years much better spell checking was available on other systems built on more powerful hardware. These could offer suggestions for corrections. Ours could not. (why?)

In the near future I want to do a chapter on the Soundex phonetic algorithm with, of course, a small Python implementation. That can provide a rudimentary way to offer possible corrections to misspelled words.

An alert reader in Portugal wrote to explain that this idea of overlaying bitmaps had been formalized before we had used it. It is called a Bloom Filter after William Howard Bloom who conceived of it in 1970. You can learn more about this at Bloom Filter

One last thing. The lexicon is quite old. The last update might have been around 1990. Some words are missing. Especially obvious are email and internet.

All files including the lexicon are in spellCheck.zip

If you have comments or suggestions You can email me at mail me

* * *