Text Processing: Soundex for Spell Checking

The Problem

Earlier we developed a spell checker and here we'll look at the soundex algorithm to see if it might be useful in offering corrections for misspelled words.

In my email program a misspelt word is underlined with a wavy red line and it was only fairly recently that I discovered that a right-click would show alternative possibilities.

In all fairness soundex was not made for this purpose. It was used for census tracking and later for geneolgy. It is simple enough to calculate the soundex code for any word by hand. With proper names it is surprisingly on target.

Some Examples

The program soundex.py will convert words on the command line to their coded form. This coded form is always four characters long, begins with the first letter of the word and followed by 3 digits. Here are some examples of proper names for people.

$ python soundex.py shultz schultz
  shultz  ('S432', 'sltz')
  schultz ('S432', 'sltz')

$ python soundex.py brigita brigitte
  brigita  ('B623', 'brgt')
  brigitte ('B623', 'brgt')

$ python soundex.py stefan stephan
  stefan  ('S315', 'stfn')
  stephan ('S315', 'stpn')

$ python soundex.py christopher christian
  christopher ('C623', 'crst')
  christian   ('C623', 'crst')

$ python soundex.py allen alan alain alun
  allen ('A450', 'aln')
  alan  ('A450', 'aln')
  alain ('A450', 'aln')
  alun  ('A450', 'aln')

$ python soundex.py smith smythe
  smith  ('S530', 'smt')
  smythe ('S530', 'smt')

The soundex algorithm is based on 4 ideas.

The first idea is knowing what to ignore. Unless they start a word, the vowels A-E-I-O-U along with the vowel-like or sometimes silent letters W-H-Y are just ignored.

Second idea. Convert any repeated consonates to just a single letter. So "mm" becomes "m", "ll" becomes "l", etc.

Third idea. Assign each remaining consonant (the vowels are gone) to one of 6 groups. The groups consist of the following letters.

Group 1:  b f p v
Group 2:  c g j k q s x z
Group 3:  d t
Group 4:  l
Group 5:  m n
Group 6:  r

The letters in each group have somewhat similar sounds. Sometimes (d, t) the tongue action is the same. The difference is whether the vocal cords are activated. This is also true of pairs in group 2. (g/k, z/s). The th combination actually ends up in group 3 since the h is dropped. The letter c conveniently is in group 2 so whether it has a k sound or s (or even z) it is properly handled. "ph" nicely becomes the equivalent of "f". And so on.

Last idea. Keep the resulting codes to a fixed length. Only the first 4 letters kept are used and if there are fewer the code is padded to a length of 4 using '0'.

With all of that all of the examples above should make perfect sense. Let's diagnose one.

schultz ('S432', 'sltz')

So the code is S432 and is formed from the letters sltz. The S is kept as the first letter. The c is not because it's in the same group (2) as s. The h is not because it's in the W-H-Y group. The u is not because it is a vowel. The l is kept (group 4). The t from group 3. And finally the z (group 2) is kept. So, there we have the code S432.

It's important to remember that Soundex was designed only for the English language with its rather quirky way of spelling words and names. Even so, it can often fail, for example 'gh' is sometimes pronounced as an 'f' and sometimes 'gh' is silent. So here things break down a bit.

$ python soundex.py tough tuff
  tough ('T200', 'tg')
  tuff  ('T100', 'tf')
$ python soundex.py dough doe
  dough ('D200', 'dg')
  doe   ('D000', 'd')

Soundex'ing our Lexicon

The test function in soundex.py will encode words on the command line if there are any. If there are not, it reads text from standard input. So we can do the following

$ python soundex.py
  Is it really 5 o'clock in Brazil?
  I200 is   is
  I300 it   it
  R400 rl   really
  O242 oclc o'clock
  I500 in   in
  B624 brzl brazil
  ^D                 (end of file character)

So, we can feed the lexicon from the spell checking project into soundex with the following

$ python soundex.py <spell.words >foo.1
$ sed "s/ .*//" <foo.1 >foo.2  #keep just 1st column
$ sort foo.2 | uniq -c | more
  6 C164
  9 C310
  6 C312
  2 C313
  7 C314

$ sort foo.2 | uniq -c | wc
  3282    6564   42666

It turns out that there are 3282 codes for the 53751 entries in the lexicon. This could also be determined in a single step by

$ python soundex.py <spell.words |sed "s/ .*//" | sort | uniq -c | wc
--  or --
$ cat spell.words | python soundex.py |sed "s/ .*//" | sort | uniq -c | wc

Soundex as part of a Spelling Checker?

By itself Soundex would not be that useful. You can see that by the following. I am often spelling "grammar" as "grammer" and "choice" as "choise" (probably because of "choose").

$ python soundex.py grammer grammar choice choise
  grammer ('G656', 'grmr')
  grammar ('G656', 'grmr')
  choice  ('C000', 'c')
  choise  ('C000', 'c')

Things start look less promising when we do

$ grep C000 foo.1 |wc
  170     510    2751

and see that words like "cushy" and "cuss" and 168 more also code to "C000". Replace the "wc" (wordcount) with "more" and you'll see exactly why.

There is additional possibility I would like to explore. There is a feature called the "Edit Distance" between two strings, say, between a misspelt word and its correct spelling. It's basically the number of changes (inserts and deletes) it takes to convert one string to another. The algorithm is a nice example of dynamic programming. Perhaps words within the same soundex group could be sorted by their edit distance from the misspelt word.

All files for the project are in soundex.zip

If you have comments or suggestions You can email me at mail me

* * *