Hash??? Not quite clear on what this is...

Colin Plumb ccplumb at spurge.uwaterloo.ca
Sun Dec 16 11:14:05 AEST 1990


In article <9012031446.AA24102 at bisco.kodak.COM> bilbo at bisco.kodak.COM (Charles Tryon) writes:
>  Well, I never learned much about hashing when I went to school (more years
>  ago than I want to think about ;-).  I know the concept of hashing, but have
>  never had a use for it in the applications I have worked on.  My question
>  is, are there any recomendations out there for a good, fairly basic book on
>  hashing?  I don't need to know all the gory details and the latest hot-shot
>  methods, just the basics.

Hashing: taking a long key (often a string) and returning a short key (often
an integer in some range) that tries to be different for every long key.
This is obviously impossible in general, and you get "collisions" where
different long keys map to identical short keys.

For a good hash function, collisions are no more likely than if you used random
numbers.  If you know all the input keys ahead of time, you can work out a hash
function that has no collisions.  If every short key is used (the N possible
long keys hash to 0..N-1), you have a "perfect hash."

A common application of hash functions is hash tables, where you want to store
some records indexed by long key and get access to them quickly.  So you keep
an array of records and index it by the hash function to find the record you
want.  Handling collisions can be done in various ways.  I like "chaining"
where you make each entry a linked list and do linear search along them.
There's also extensible hashing, where you copy everything to a bigger table
and arrange things so there are no collisions.  Then there's linear probing,
useful for tables that never need deletion.  You search forwards from the
hash position until you find an empty slot.  If you're writing, put it here.
If you're reading, this means that the value in question has never been
written.  Simple, but its performance gets Very Bad as the table fills up.

Then there's double hashing, where you search forwards in steps of a second
hashing function, and otherwise works the same (this works very well), uniform
hashing (where the hash function produces a permutation of the table buckets
to search in - double hashing is a very close approximation), and other tricks.

More unusual applications involve traditional indexing techniques (you don't
fill up the space as much), but you store a hash of the key because it's more
compact and quicker to compare.  Diff uses this to compare lines... it hashes
a line of text and compares them first, then double-checks with a full text
compare if the lines are equal.

Another interesting variant was used in a version of spell(1), although I don't
know if it's the current one or not.  The dictionary of correctly spelled words
was hashed into a large hash space (27 bits?) and the result stored as a
bitmap.  No, not as 2^24 = 16 Megabytes, but the gaps between adjacent bits,
with some higher-level indices.  that let you check if bit #4392859 is set
fairly quickly.  This has the possibility of missing a misspelled word, but
it's very rare and a "stop list" of words it fails to catch was added to
deal with exceptions that actually were noticed in practice.

For integers to integers, input%outputrange works fairly well if you arrange
for outputrange to be prime.  For strings, hash = fiddle(hash)+*p++ is popular.
Someone just posted fiddle(x) = x*31 and x*33.  (You range reduce mod 2^n
for suitable n eventually.)

One that was published in CACM recently was to set up a table with a random
permutation of 0..255 and loop over hash = table[hash^*p++];.  If you set up
the table more methodically, you get an 8-bit CRC of the input string, but
any random permutation works well.

Knuth (section 6.4) mentions multiplicative hashing (take some of the middle
bits of constant*input).

That was quick, but that's the guts of the thing.



More information about the Comp.lang.c mailing list