Storing substrings: some real C code in comp.lang.c!

T. William Wells bill at twwells.com
Mon Oct 30 04:55:36 AEST 1989


In article <6871 at hubcap.clemson.edu> hutch%citron.cs.clemson.edu at hubcap.clemson.edu writes:
: From article <1989Oct25.091519.10718 at twwells.com>, by bill at twwells.com (T. William Wells):
: > I have been beating my head against the wall, looking for speed,
: > *speed*, and MORE SPEED in this program I'm working on. At only
: > 800 WPM (on my 16MHz '386) it is not exactly a speed demon.
: > Especially since it has to be run on large lists of words, none of
: > which are smaller than 80,000 words.
: >
: > Bill                    { uunet | novavax | ankh | sunvice } !twwells!bill
: > bill at twwells.com
:
: First, forget the tree, and the hash.  What you need is a Finite State
: Automaton.  The basic idea for this can be found in the May 1988
: issue of CACM in an article on an implementation of SCRABBLE(tm).

My first implementation (a trie) is exactly what an FSA would
generate as output. The major cost in it is searching each level
of the tree. Knuth shows a way around that; it has been
implemented in the patgen program that came (comes?) with TeX.
There was also that Programming Pearls article in CACM. (Sorry,
exact references are not handy.)

However, those have the drawback that a test of each level of the
tree is a bit more complex than in mine. Also, the code to insert
into those trees is nontrivial. And slow. Yes, I've used these
algorithms.

: I've implemented a version of this and I can build the FSA in under
: a minute (for about 50000 words) on a sun-3 and under 5 minutes on
: a 80286.

In a recent test, my current implementation has the routine
called 454,094 times, 19,194 of which actually add new strings to
the table. It takes 24 seconds on my '386, which is about .9 VAX
MIPS. I think that beats your FSA.

BTW, for those interested, the hashing line in my original code
should be replaced with:

		hash += (hash << 6) + *ptr;

The hit rate goes down by a factor of 6!

:           If possible, you should build the FSA for the words in
: a separate processing step since it does take a bit of memory.

Not possible, since the strings have to be generated on the fly.
To do it otherwise would require generating a *huge* temp file.

: PS:  Since you are doing something confidential, I assume it is
: for *business*.

Not a good assumption, since it could be research someone doesn't
want yet bruited about, or for the government, etc., but in this
case, you are right.

---
Bill                    { uunet | novavax | ankh | sunvice } !twwells!bill
bill at twwells.com



More information about the Comp.lang.c mailing list