hash function
Dik T. Winter
dik at cwi.nl
Wed Jun 26 23:25:56 AEST 1991
In article <17918.Jun2608.17.5091 at kramden.acf.nyu.edu> brnstnd at kramden.acf.nyu.edu (Dan Bernstein) writes:
> In article <15047 at ulysses.att.com> kpv at ulysses.att.com (Phong Vo[drew]) writes:
> > For everyone's entertainment, I did a simple (and admittedly contrived) test
> > of hashing 1000000 numbers starting at 1000000 into 1000000 buckets.
> > This is done by treating each number which is stored in a word as a
> > string of 4 bytes.
>
> Sorry, but the mod has to be a power of 2.
Why? (If you are telling me for speed I'll scream.)
>
> Surely you noticed that a multiplier of 256 will produce perfect hashing
> in this case. Does that mean you're recommending a multiplier of 256?
Only on big-endian machines of course :-)
>
> The profiling statistics I've saved from various compressors form a much
> more realistic sample. They show 33 as slightly better than random
> hashing on typical files, 37 as slightly worse. I've never tried 261 or
> 31, but I'll bet 261 does worse than 33 on text files.
>
I did some simple tests, hashing all entries of /usr/dict/words, counting
how many entries fall in each bucket. I did this with the four multipliers
Dan just mentioned. I did this for 16 buckets to 8192 buckets (only powers
of two done). I calculated also the standard deviations involved. I will
not bore you with all those digits, the result can be stated simply: it does
not matter very much what multiplier you use. For each multiplier there
is a number of buckets where that multiplier has the smallest standard
deviation. So use whatever you like.
--
dik t. winter, cwi, amsterdam, nederland
dik at cwi.nl
More information about the Comp.lang.c
mailing list