(LONG) Writing a spelling checker -- advice wanted

Sun Feb 4 15:54:36 AEST 1990

This is a *LONG* posting, about the design of a spelling checker I have been
working on.  If you are a writer/maintainer of these beasties please try
to read this as I would appreciate your collective advice.  If you are
a user of spelling checkers and have strong views on *features* needed, again
please let me know.  BUT I'm not all that interested in the details of
user-interfaces; if you were to read on you would see why.

Excuse any spelling mistakes -- I'm visiting a friend today & don't have
a checker to hand :-)

Graham  <gtoal at ed.ac.uk>

============================================================================
Graham Toal              Design notes on spelling checker             03/02/90

   This document is an electronic notebook of ideas/reminders for my current
spelling-checker project.  Don't expect coherence...  that comes later :-)

1)  Goal:
      to produce a set of C procedures which form the kernel of a spelling
    checker.

    It is my intention that these procedures will be the last word in fast
    spelling checking.  They will be made available to anyone writing a
    checker for inclusion into their user interface.  I also want to
    define a standard form of dictionary for access by these procedures
    which will be portable between machines.

-----------------------------------------------------------------------------

[The description which follows of the data structure is really poor.  I
should construct *real* examples and give quotes from the various papers;
I haven't done so for the laziest of reasons.  I may do so later if
asked.  If you already know of Appel's/Liang's work you'll probably follow
this otherwise it might be worth skipping on first reading.]

The data structure used by this system is a Directed Acyclic Graph.  The only
reference I have found to previous uses of this structure is [Appel1] who uses
it in his Scrabble(tm) game.  In this paper he calls the structure a DAWG
(Directed Acyclic Word Graph) so I shall use this terminology too. 

The easiest way to represent a dawg is with a Trie [Knuth1] which is a tree
with a branching ratio greater than 2. 

The simplest form of Trie is the Linked Trie, where both branches to a new
level, and options at the same level are connected with pointers:

(words: act, bat, fat, me, my)

ROOT:
0:    ['a', 62, 1224 ]    (  --->  branches are horizontal  )
      |
      | (a node which is a choice at the same level is depicted vertically)
      |
                 1224: ['c', (end), 3346] 
                                    3346: ['t', (end), (end) ]
62:   ['b', 115, 6551 ]
      |          6551: ['a', (end), 3346] ('t' is shared with the end of act)
      |
115:  ['f', 232,  6551 ] ('at' is shared with the end of 'bat')
      |                       
232:  ['m', (end), 10238 ]
                   10238: ['e', 10564, (end)]
                   |
                   |
                   10564: ['y', (end), (end)]

Appel's DAG is stored as a Trie, with each set of branches at any level stored
sequentially:

----> [  'a'  -> 1224 ]
      [  'b'  -> 6551 ]
      [  'f'  -> 9960 ]
      [  'm'  -> 10238 (END OF NODE) ]

In this style of Trie, each set of branches is stored sequentially and the last
has a flag bit set to terminate the list. 

There is an alternative form of Trie which is an indexed trie:

----> [  'a'  -> 1224 ]
      [  'b'  -> 6551 ]
      [  'c'          ]
      [  'd'          ]
      [  'e'          ]
      [  'f'  -> 9960 ]
      [  'g'          ]
              :
              :
              :
      [  'l'           ]
      [  'm'  -> 10238 ]
      [  'n'           ]
      [  'o'           ]
              :
              :
              :
      [  'z'           ]

In this form, all <N> (where <N> is the branching ratio) options are stored, so
searching such a structure is fast because it is an indexing operation. 
However it is at the expense of Space. 

The data structure I have chosen is indexed as above, but the spaces in each
branching node are used by overlapping several such nodes in store.  This
structure is not obvious (It needs a better diagram which I can't be bothered
typing just now) but is fully described by Liang [Liang1] in his paper on
Hyphenation.  He uses a packing algorithm to save space in a trie used for
hyphenating words in the TeX typesetting language. 

----------------------

Appel's DAWG and Liang's packed tries have their own advantages.  Both take up
approximately equal space: Appel's is better if you want to extract the list of
words from the data structure, because you can walk the tree cheaply; however
looking up a know word is slower because you have to sequentially scan each
node.  Liang's is better for looking up a word because it is a straight index:
one operation for each character in the word.  However walking the tree is
expensive because you have to scan all <n> (the branching ratio -- 26 in this
case) pointers for each node, checking each one to see whether it is valid or
not. 

I intend to support both styles.  The lookup code for both is very short, so
there is no significant overhead once the choice of lookup procedure has been
made.  (It can be stored in a procedure variable).  The utility which generates
the DAWG from a list of words generates an Appel DAWG; the Liang packed trie
can be looked on as an optimisation.  A proof-of-concept implementation has
shown the latter makes a spelling checker run approximately three times as
quickly. 

---------------------------------------------------------------------------

The trie structure is not optimal -- because the data being stored are words in
the English language (or others) there is a lot of redundancy of which we can
take advantage.  A trie structure saves space because common initial strings
are merged.  Some spelling systems also merge common tail strings (by explicit
affix analysis and tokenization).  We, however, intend to take advantage of
*all* common text -- start, middle, and end.  We do this by converting the trie
to a dag, which is fully state-minimized.  When building the trie, we compare
each node to every other node already created, and when we find two the same we
merge them and all pointers to them.  The comparison is done by hashing each
node as it is created.  The hash table is the same size as the space allocated
for storing the nodes themselves. 

The DAG can be viewed as a finite-state machine recognising the grammar defined
by the set of words in the lexicon. 

----------------------------------------------------------------------------

Converting from a Appel dawg to a Liang packed overlapping trie *could* use ram
space of twice the size of the dawg - one input array & one output array;
however tests have shown that the first-fit packing heuristic is so good that
one array can be used for both input & output data: If a space of (say) 4K is
allocated immediately in front of the input array, the packing can proceed down
the input array sequentially writing to an output area approximately 4K behind
itself.  The exact size of this gap may vary, but it never catches up. 
High-water/Low-water marks are kept for both arrays to check the small
probability that they do overlap; this has never happened, and recompiling with
a larger gap will avoid the problem.  (Or it could even be a run-time option)

----------------------------------------------------------------------------

Implementation details of DAWG files:

The data structure is prepared once from a file of words (which has to be
sorted with this algorithm; a version which allowed arbitrary ordering was
O(n^2) because of the time taken to coalesce nodes when creating the DAG from
the Trie.  The already-sorted version can do this on the fly) and is loaded
into memory by the checker. 

The file contains a sequence of *words*, not bytes.  Each word is an *index*
pointing to the next node, with the index relative to the start of data.  The
intention is that a C int array is mapped directly on to this layout. 

(I believe the word-file concept will mean that a DEC10 36 bit implementation
will not be in the least problematic; perhaps won't even need any source
changes?)

On systems with real memory, the whole file would be loaded into memory with
the fastest possible I/O primitives.  On systems with virtual memory,
performance advantages can be had by directly connecting the file in
read-shared mode, so that multiple users share the same physical pages and take
advantage of the VM caching. 

The file will have a word-long header which can be used to determine the
byte-sex which the dawg was saved under.  If this is not the same as the
machine on which it is being run, the checker will warn & reverse the words in
store.  This will allow the same file to be used on a network, where most
machines use one byte-sex and a few user the other.  It seems wasteful to have
two versions of the file.  The reversal is quick and done only once. 

The file will also have a magic number to identify it; one magic number will
mean Appel-style dawg & another Liang-style packed trie. 

It will have some string identifiers.  I haven't yet decided exactly how these
will be stored, but some sort of indirection is needed.  The file must contain:
what character set is used (eight-bit data is acceptable -- the only banned
character is NUL); and the language of the words in the file; and a descriptive
text string for such things as copyright & subject description (eg specialised
lists for biology etc)

It may contain also: a vector for soundex reduction (see section on spelling
correction) because different languages need different rules; a pointer to a
hyphenation database (TeX's algorithm) and a more complex phonetic database for
higher-level spelling correction (also explained elsewhere). 

-----------------------------------------------------------------------------

Coding details:

    This code is intended to be maximally portable.  No fancy features of C
will be used; I'll try to avoid procedure variables if I can, for instance. 
ANSI headers will be present if the compiler supports them. 

There will be a header file "grope.h" which tests all the predefined macros of
the compiler and depending on what it finds will set various preprocessor
macros.  For instance, instead of testing __ANSI_C__ or MICROSOFT etc, to see
whether prototypes are valid, it will test for #ifdef PROTOTYPES.  Similar
flags will be set if VOID is understood by the compiler etc. 

The main flag affecting code generation is MEMORY_MODELS, which if set means
that the code is compiling for a brain-dead cpu like the Intel8086, which
doesn't allow arrays of > 64K.  This affects the DAG-generating program, and
causes it to generate the DAG in many chunks - one for each top-level letter
(ie 26, except it may be more because of international characters) instead of
in one go.  Field tests show that 64K is enough to generate the compressed DAG
for all the words starting with any particular letter in English. 

The code will be indented/formatted consistently and in such a way that tools
like 'indent' can reformat it to your preferred style. 

Initial testing showed up problems such that names like 'entry' which are not
reserved in ANSI *are* reserved in K&R.  (This was spotted on the Atari).  To
avoid this, all external variables will have a unique prefix. 

----------------------------------------------------------------------------

Spelling correction:

Typos caused by transposition, & single insertion/deletion or replacement are
generated from the word, and then checked in the dictionary for validity. 
These are presented to the user as a first choice. 

   Next, a wrong word is reduced to a canonical form.  Then the word tree is
walked, with each word being converted to the same form on the fly; these forms
are compared and matches are possible candidates for correction. 

The first algorithm for canonical form conversion is soundex. 

The list of words generated by this method is then evaluated by a function
which does a 'fuzzy compare' against the original and returns a value which is
a distance metric between the original and the suggestion.  The list of
wwords+values are then sorted, and the best <n> are presented in order of
nearness.  Some hacky heuristics are useful here to prune the lists at suitable
places, eg with target word = faero, suggestions of -- faro (91), faerie (87),
pharoe (78), furry (27), fury (19) -- we would note the disproprtionate drop
between pharoe & furry and trim the list at that point.  (The heuristic being
if the value of one word is less than 50% of the next higher word)

I have several different 'fuzzy compare' algorithms.  I may include more than
one in the code & take (say) the best 4 words of each algorithm and merge the
results.  (They don't always give the same ordering)

Typos should also be ordered by distance metric; I believe Dammerau &
Levenstein did work on this; I haven't seen the paper.  Such a metric should
take into account distance between keys on a typewriter, doubled-letters being
more likely that random insertions etc. 

----------------------------------------------------------------------------

What I *really* want to experiment with is an algorithm of my own: I want to
convert the word to a highly-accurate phonetic representation (a bit like a
text-to-speech phonetic string) and compare these.  The algorithm doesn't need
to be as good as true text-to-speech; but it has to be much better than soundex
which gives far too many false drops. 

The main problem is with diphthongs such as 'ph' - when is it possible to
convert these to single letter representations? Take a word like 'haphazerd'
for instance - if converted to 'hafazerd' (as one text-to-speech system I heard
did!) it would never find a match with 'haphazard' (Well, it would because that
would be mangled too :-) ...  but phoenix vs feenix would not be caught)

So the problems is to spot chance occurances of letters which would otherwise
be diphthongs.  Well, the answer is easy! Split the word up using TeX's
hyphenation rules (haphazard -> hap-ha-zard), reduce that phonetically (with a
set of rules like 'ph -> f') and compare these results.  I think it will work &
I'm looking forward to having time to try it. 

One final tweak: this may be expensive, but I want to try a distance-metric
comparison function which works like the unix 'diff' program: it will give me
an edit command to turn one word into another; if I score the number of
insertions/deletions & weight by size of inserted/deleted text, then I might
get a very good distance metric indeed.  (Especially if the alg is like the
CACM diff which undertands moving & replacments; a quick phonetic similarity
check on replacements would refine the metric even more). 

A quick & dirty hack of diff - by taking two words and writing them out to two
files, then diffing the files - shows that this works! An improvement is to
write out diphthongs on one line instead of two; this also works!

For example:

word 1 = phoenix ->
ph
oe
n
i
x
word 2 = feenix ->
f
ee
n
i
x

diff returns something like
/ph/f/     Phonetic similarity = 90%
/oe/ee/    Phonetic similarity = 85%
Score = 100 * 90% * 85%

Implementation of similarity test:  both the keyboard layout similarity
and the phonetic one can be implemented by a matrix of input values down
the side and output values across the top; diphthongs will have to be
converted to a single indexing code.  These matrices could be stored in
the lexicon file perhaps.  Sparse storage techniques might be needed --
Liang's trie packing again? (ought to work)

----------------------------------------------------------------------------

This is really an aside: There was an algorithm published in Byte some
years ago for efficient multi-word anagram finding.  The dawg data
structure would be great for that.  It is also of course ideal for
Scrabble :-) And it's nice to have lexicons sitting around in a standard
format if you want to hack word games.  However, a practical use of the
anagram stuff would be to detect complex transpositions - sometimes 3
letters get cycled when your brain is going faster than your fingers!
(The ordinary type correction code just catches single
pair-transpositions)

----------------------------------------------------------------------------

Dictionaries to cover:

    Proper names
    Place names
    Various languages
    Abbreviations
    Specialist disciplines (biology etc.)

I've started collecting these.  I have a pretty strictly vetted general
word list of about 690K of ascii text, Most UK telephone exchange names,
and all Edinburgh-registered usernames from the mail directory.  I have
a short French word list and a Scots Gaelic one.

----------------------------------------------------------------------------

UK/US & related issues:

    I have generated a large file containing words which are spelt differently
in the uk/us, such as traveled vs travelled etc. 

The file is of the form:

travelled=traveled
developped=developed
colour=color

There is a corresponding file with the words reversed: (no national chauvinism
here :-) )

traveled=travelled
developed=developped
color=colour

I think the spelling checker's main lexicon should include *both* uk & us
forms, but have a (possibly seperate) utility or mode which runs a 'translation
pass' across the data, finding the text on the left hand side & replacing it
with the text on the right (or just flagging the word as an error - it's up to
the user interface). 

I have sets of files for each consistent uk/us difference; one set for pped ->
ped, one for our -> our etc.  I also have one for ize->ise and ise -> ize. 
(this being useful even in the UK to enforce house-style consistency)

   The user could thus customise his/her preferences by creating a merged file
from the bits & pieces above.  The mechanism could also be used to tidy up
things like pavement=sidewalk when exporting a document from the UK to the US
:-)

   The data structure for this is exactly the same as for spell checking - just
treat the '=' as another character; however the spell check routine has to stop
on '=' instead of end-of-string, and it has to return a string as the result. 
In fact it will be best to return a node index; this will mean that if there
are entries with multiple right-hand-sides, a list can be generated by walking
the tree to the right of the '='. 

For example, the US uses only one form of 'licence' for
both noun and verb;  we might imagine entries with
         licence=licence  and   licence=license
so on finding 'licence' the user would be offered both forms to choose from.

----------------------------------------------------------------------------

A related issue is for foreign words, in languages like Dutch where words are
strung together:

                     Ziektekostenverzekering

It would be useful for a spelling checker to be able to split these words up
into individual words, and check those.  Such an algorithm wouldn't be perfect
but it would be better than nothing.  In fact, what you would do would be to
parse the compound word.  This might allow more than one valid parse, but we
don't care!

For instance in the word above, parsing from the left would give:

     Ziektekostenverzekering
     Ziek
         te
           kost
               en
                 ver
                    zeker
                         ing  <---- parse fails! No such word
                              ... so backtrack one level ...
                    zekering  <---- valid, so word passes.

In fact, the 'correct' parse for this would be:

    Ziekte
          kosten
                verzekering

Which *could* be discovered by evaluating all valid parsings and accepting the
one with the fewest discovered subwords. 

Incidentally, if performing hyphenation on languages like Dutch, you have to
split the compounded words up before you apply the hyphenation rules.  This
might be relevant to the spelling code if I implement my overblown phonetic
spelling correction scheme.  (By the way, I've *already* implemented Dutch text
-> phonetic in prototype & it is *very* good!)

*** Note: compound-word decomposition is useful in English too: programmers
often string together words (eg filesystem) which in a report should really
be written as file-system.  The spelling correction algorithm will offer
such a hyphenation if the word is longer that a critical length and no
soundex replacement comes close enough.

----------------------------------------------------------------------------

Word isolation in input files:

   Strictly speaking, this isn't my problem; I don't really want to write
user interfaces to the spelling routines;  however a few short notes for
those hardy souls who'll take it on may be worth jotting down...

    Split documents up into sentences, so that initial words (which happen to
have capitals) can be differentiated from proper names which must have
capitals; this will allow you to use the correct lexicon for spelling
correction (proper names being looked in first before ordinary words...)

   The dictionary will contain proper names *with correct capitalization*, so
they can be looked up verbatim first; then the word can be lower-cased and
looked up again if that fails. 

   Hyphenated words have to be discovered, and similarly words which contain
integral apostrophes -- such as cat-o'-nine-tails or Will-o'-the-wisp.  These
will be stored correctly in the lexicon.

   Possessives should have the 's or 'ses stripped -- but abbreviated forms
like haven't should be passed through properly, not as haven + t.

   Single letters will probably have to go through as being correct.

   Initials should be checked if possible;  OHMS, DOD, R.s.v.p - criteria
of recognition include all uppercase or letters seperated by points.

   `Quoted strings' should not accidentally turn any quotation marks into
apostrophes.

   I haven't used yacc/lex myself, but I suspect they may be the best tools
for parsing input in this way.

   Word processor formats (TeX, roff etc) should be recognised in a
portable way - eg by defining a grammar of what constitutes a word-processor
directive (''ignore all alphabetic sequences preceded by backslash'')

   Words hyphenated in the input document because they happen to come at
the end of a line ought to be checkable properly:  first treat it as a
single word (laissezfaire for instance) and if that fails, check it
as a hyphenated word (laissez-faire).

-----------------------------------------------------------------------------

User interfaces:

   again, only brief notes.

My favourite I'll leave to last.  It's a new one!  But first...

1) 'True' emacs style:

"filename", Line 234, Column 42:  wimble

or maybe if emacs' macro were expanded a little to handle replacements

"filename", Line 234, Column 42:  wimble -> womble, fubar -> foo-bar

2) uEmacs style:

(Don't have this to hand, but its a more compressed representation)

3) Unix 'spell' style:

Sorted list of words

4) Similar to above, but with words in order or appearance.  Possibly
including duplicates.

5) Interactive query/replace  (I believe VMS has a good one like this)

6) 'Check-as-you-type' - goes beep when you type an error. (I *HATE* these!)

-----

Now my idea of the year ( :-) ) ...  I've actually implemented this on
my Acorn Archimedes so I know it works.

You hook the spelling checker into the console output stream.  It has a
state-machine following output, and whenever a word has been output it checks
the word.  If it fails the check, it deletes back over the word and re-writes
it in inverse video/colour/whatever.

Note that this is *NOT* check as you type -- it is check-as-the-computer-
outputs!  It means that you have automatic spelling checking added to your
own favourite editor, for instance.   If you page down in the editor, then
all errors on the newly visible page are already highlighted!  On my Archie
implementation, I can use a terminal emulator while spelling checking is
turned on, log on to a unix machine, 'cat' a file (at 9600 bd) and the file
flies by -- checked as it goes with almost zero overhead.  (Yes, the check
code is that fast.   I estimate 50K words per *second*.  The main overhead
is recognising/extracting the words -- the check is essentially free. 

On machines where it is not possible to intercept output routines via vectors,
it may be possible to add a level of system utility to acheive the same
effect - for instance Unix could do so by modifying 'script' (and using
'curses' for the highlighting?)

IBM PCs may be able to do it by continuously scanning the display memory
under interrupt. 

If the systems supports a mouse, it might be possible to pop up a mouse
menu when pressing over a highlighted word to offer corrections -- and
a suitable number of deletes would be simulated and the word pushed into
the type-in buffer.  (I haven't tried anything this advanced though :-) )

-----------------------------------------------------------------------------

I should really finish with an interesting point but its about 4am & I've been
typing since just after midnight.  This started off as simply a brain-dump for
my own benefit, but I think I'll send it to usenet with a call for suggestions,
constructive criticism, and perhaps for volunteers!

These ideas have been germinating since I first looked at the problems
back in October, but pressure of work has laid them to one side.  Now I'd
like to get the project together & finish it soon.

If there has been anything in here you feel like commenting on, *PLEASE*
mail me.  All mail will be answered (networks permitting) and certainly
appreciated.  Don't clog up the net with large postings -- I've probably
used up my share of goodwill by posting this! (Especially if I have to
crosspost since I don't know the best forum to send it to!)

My mail address is NOT the same as the poster, whoever that turns out to
be; I can be contacted at:

                     Graham Toal <gtoal at ed.ac.uk>

Many thanks for wading through to the end.

PS If you were waiting for the references to papers -- sorry! Don't have
them here either...