hash function

Wed Jun 26 23:25:56 AEST 1991

In article <17918.Jun2608.17.5091 at kramden.acf.nyu.edu> brnstnd at kramden.acf.nyu.edu (Dan Bernstein) writes:
 > In article <15047 at ulysses.att.com> kpv at ulysses.att.com (Phong Vo[drew]) writes:
 > > For everyone's entertainment, I did a simple (and admittedly contrived) test
 > > of hashing 1000000 numbers starting at 1000000 into 1000000 buckets.
 > > This is done by treating each number which is stored in a word as a
 > > string of 4 bytes.
 > 
 > Sorry, but the mod has to be a power of 2.
Why?  (If you are telling me for speed I'll scream.)
 > 
 > Surely you noticed that a multiplier of 256 will produce perfect hashing
 > in this case. Does that mean you're recommending a multiplier of 256?
Only on big-endian machines of course :-)
 > 
 > The profiling statistics I've saved from various compressors form a much
 > more realistic sample. They show 33 as slightly better than random
 > hashing on typical files, 37 as slightly worse. I've never tried 261 or
 > 31, but I'll bet 261 does worse than 33 on text files.
 > 
I did some simple tests, hashing all entries of /usr/dict/words, counting
how many entries fall in each bucket.  I did this with the four multipliers
Dan just mentioned.  I did this for 16 buckets to 8192 buckets (only powers
of two done).  I calculated also the standard deviations involved.  I will
not bore you with all those digits, the result can be stated simply: it does
not matter very much what multiplier you use.  For each multiplier there
is a number of buckets where that multiplier has the smallest standard
deviation.  So use whatever you like.
--
dik t. winter, cwi, amsterdam, nederland
dik at cwi.nl