hash function

Martin Golding martin at adpplz.UUCP
Thu Jun 27 06:11:34 AEST 1991


In <15047 at ulysses.att.com> kpv at ulysses.att.com (Phong Vo[drew]) writes:

>For everyone's entertainment, I did a simple (and admittedly contrived) test
>of hashing 1000000 numbers starting at 1000000 into 1000000 buckets.
>This is done by treating each number which is stored in a word as a
>string of 4 bytes. With 261, 927300 of these numbers are
>in their own buckets, the rest are in 36350 buckets each with 2 elements.
>This is almost as good as a perfect hash function. The result for 33 is
>much worse. The largest bucket has 64 elements in it and there are 4050
>such buckets. The key difference is the 8th bit in 261 which causes the
>mixing to go a little faster.

Your test is probably irrelevant, being contiguous sequential numbers.
The best hash in this case is
for( hash = (i = 0);  i < len(numstr); hash = hash*10 + numstring[i++]);
hash = hash % numberofbuckets;

This will yield the best possible distribution for any number of hash 
buckets. The proof is left to the reader.

The point is not that there is a better hash for sequential numbers, but
that the performance of _any_ hash function is highly dependent on your keys;
if you any idea of the values or kinds of values to expect you can choose
a better function. If you can _control_ the values of your keys you can
sometimes get a _much_ better function; viz the numbers above.

If all this is old and boring stuff, I apologise for interrupting you all
just to look like an idiot (flame me by email), and I now return you to
your regularly scheduled newsgroup.


Martin Golding    | sync, sync, sync, sank ... sunk:
Dod #0236         |  He who steals my code steals trash.
A poor old decrepit Pick programmer. Sympathize at:
{mcspdx,pdxgate}!adpplz!martin or martin at adpplz.uucp



More information about the Comp.lang.c mailing list