Text Processing: A Simple Spelling Checker

The Problem

In this chapter we will develop a small and extremely fast spelling checker that is completely memory resident and with a surprisingly small memory footprint.

The motivation for such a program goes back to the mid-70s when I was working at a newspaper as a computer programmer. The lack of any form of automatic spell checking meant huge amounts of time for human proof readers to carefully find misspelled words and correct them. We wanted a program that could simply flag a word that was not in a lexicon (ours had over 50,000 words)

This is fairly trivial with todays computer technology. We have a file of 53,751 common words spell.words that looks like the following.

a
aal
aalii
aaliis
aals
.
.
.
zwieback
zygal
zygon
zygosity
zygotene
zygotic
zymaturgy
zymes
zymic
zymin

We can get this into a list in memory with the following

>>> words = open("spell.words").readlines()
>>> words = map(lambda x: x.strip(), words)

The strip function is needed to remove the newline character at the end of each word.

Now we can simply check if a word is in our list

>>> 'zygotic' in words
True

But if we were checking each word in a 1000 word story the timing requirements would be similar to

>>> for i in range(1000) : a = 'zygotic' in words
...
>>>

And even on my laptop in 2014 there is about a 2 second pause.

But that's really not a problem today. In Python we have dictionaries built into the language. So we can simply do the following.

>>> hash = {}
>>> for word in words : hash[word] = True
...
>>>

Now with even 100 times the number of lookups above the prompt returns almost immediately.

>>> for i in range(100000) : a = hash.has_key('zygotic')

Alas, in the mid-70s this kind of computer power was only a dream. In 40 years we have had about a thousand-fold increase in CPU speed, main memory and disk capacity. Yesterday's megabyte is today's gigabyte. Yesterday, CPU cycles then were measured in microseconds, today in nanoseconds. Our server computer where the spell checking program would run (along with about 20 other programs) was a PDP-11/70 that had only 1.5 megabytes of main memory and precious few CPU cycles to spare.

What we did was something similar to the hash table. A hash functions can generate a pseudo-random address that is repeatably for a key. Each word in the lexicon can be a key and can be converted to an address. This can be done inexpensively in terms of time, And since we need only True and False values, we can replace them with '1' and '0'.

So consider a bitmap with a million bits. If each of the 53751 words in our lexicon is used to generate a hash between zero and 999,999 then we can set the corresponding bit in the bitmap to one. And later, to check if some word is in the lexicon we simply generate its hash and see if that bit is set. If it is, then we assume the word is correctly spelled. If not, then a misspelling is assumed.

The second assumption is fine. If the bit is not set then the word is clearly not in our lexicon (our definition of misspelled). But it can happen that a truly misspelled word still generates a hash already set by one (or even more) words from the lexicon. This results in a "false positive" and cannot be eliminated. But it can be minimized.

If each of the 53751 words generated a unique hash then we would set exactly that many bits. But because of the potential for overlap fewer actual bits are set. It turns out that the chances of getting a false positive from a misspelled word are a bit less than one in twenty in this configuration.

That's not quite good enough.

One way to improve things might be to simply double the size of the bitmap. Then only about 1 bit in 40 would be set and our rate of false positives would drop by half. Of course, our hash function needs to span the new size.

Even better, and much more interesting, would be to make a second bitmap with a different hash function and let each word in the lexicon set one bit in both. To check a test word we see if both bits are set. Words not in the lexicon now have only a 1/400 (20*20) chance of being a false positive.

But now we've used twice the memory and the result is still only so-so. A interesting compromise is to use just a single bitmap but to still set bits from 2 independent hash functions. Then about 1 bit in 10 will be set and the error rate will be about 1/100 (10*10). Well, that's still better than 1/20.

We can continue this with even more functions. With each one the bit density increases (a bad thing) but the false positive rate continues Here are some results for using 4 to 7 hash functions.

>>> (194320./2**20)**4    # 4 functions setting bits
  0.0011794               #        expected error rate
>>> (236966./2**20)**5    # 5 functions setting bits
  0.0005894
>>> (277378./2**20)**6    # 6 functions setting bits
  0.0003426
>>> (277565./2**20)**7    # 7 functions setting bits
  0.0000911

This is actually done with a bitmap size of 1048576 (2**20). The numbers 194320 to 277565 are the number of bits actually set by our lexicon. Divided by the total bits gives us a basic fraction that is then raised to the appropriate power to produce the expected rate of false positives. For 4 functions this is a bit over 1/1000. By 7 it is about 1/10000. For reasons that will become clear shortly, we'll use 6 functions.

A Bitmap class

Our module bitmap.py defines a class Bitmap that can be used as follows.

>>> import bitmap
>>> b = bitmap.Bitmap(64)
>>> b.map
[0, 0, 0, 0]

The simulated bitmap is a list of integers, each with 16 bits. So 4 integers give us a bitmap of 64 bits. Watching b.map change is peeking under the hood. We can set bits as follows.

>>> b.setBit(0)
>>> b.map
[1, 0, 0, 0]
>>> b.setBit(1)
>>> b.map
[3, 0, 0, 0]
>>> b.setBit(63)
>>> b.map
[3, 0, 0, 32768]
>>> b.getBit(63)
True
>>> b.getBit(62)
False

For our spell checker we get our million (1048576) bit map as follows

>>> import bitmap
>>> b = bitmap.Bitmap(2**20)
>>> b
<bitmap.Bitmap instance at 0xb724decc>
>>> len(b.map)
65536

The last call to len shows that we are using 65536 (64k) 16 bit words to get our bitmap of 1048576 bits. At the time this was about 12% of the memory in our PDP 11/70.

Getting our Hash Functions

We had studied possible hash functions fairly carefully to make sure that each would give even results across the bitmap and that the functions were independent of each other. But for this simulation I'm going to simply use the hashlib module's md5 function to generate a nice 32 digit hexidecimal number (string form) from any input string. I suspect it is just fine, especially for this demonstration.

>>> import hashlib
>>> hashlib.md5('zygotic').hexdigest()
'2368434ae5c0612e153f3a37283767bf'

It takes 5 hexidecimal digits to map 2**20 bits (why?). If we slice the first 30 digits into 6 5-digit chunks then we have 6 independent hash numbers to use.

>>> b = '2368434ae5c0612e153f3a37283767bf'
>>> int(b[0:5],16)       # Convert string to integer as hex (base 16)
145028
>>> int(b[05:10],16)
215781
>>> int(b[10:15],16)
787986
>>> int(b[15:20],16)
922943
>>> int(b[20:25],16)
238450
>>> int(b[25:30],16)
538471

You can see that the bits that are to be set for "zygotic" are all in the appropriate range.

A Simple Spell Checker

Our program spellCheck.py pulls all of these ideas together. It first creates and populates a bitmap with all 53751 words (that takes a couple of seconds). It then takes text from standard input. The text is first striped of punctuation and set to lowercase. Then each word is spell checked and output to standard output with the result. The logic of the program should be straightforward.

This is our test file

$cat test.text
Now is the time for the zygotic zwieback to be eeaten.
Don't do it in an air-conditioned settting, ok?

$ python spellCheck.py spell.words <test.text
True   now
True   is
True   the
True   time
True   for
True   the
True   zygotic
True   zwieback
True   to
True   be
False  eeaten
True   don't
True   do
True   it
True   in
True   an
True   air-conditioned
False  settting
True   ok

Comments

In our original implementation we used only 4 hash functions and only 32768 words of memory. That two second pause above that accompanied the populating of the bitmap with the lexicon was about a minute then. That's not a thousand times slower. We gain about a 30 to 1 advantage in using C or assembler instead of Python. Also I'm not sure how expensive (in time) the md5 function is.

The original program worked as a server program (at the time called a "spooler"). The bitmap with lexicon loaded resided as a file on the disk. This could be loaded in a second or too. But it was difficult to change the lexicon on the fly. Deleting a mistake meant reloading the entire bitmap. (why?).

For its time, the little program was a dream come true and it spread from our newspaper to about 30 others using the same computer system for typesetting. But within 10-15 years much better spell checking was available on other systems built on more powerful hardware. These could offer suggestions for corrections. Ours could not. (why?)

To see how far we have come since then, take a look at the Coursera course Natural Language Processing from Stanford Online. At the time of writing it was not available as a course, but the videos could be watched as a preview. The second half of week two is devoted to spelling correction.

In the near future I want to do a chapter on the Soundex phonetic algorithm with, of course, a small Python implementation. That can provide a rudimentary way to offer possible corrections to misspelled words.

One last thing. The lexicon is quite old. The last update might have been around 1990. Some words are missing. Especially obvious are email and internet.

Copyright © 2014-2015 Chris Meyers

.