hash function
Chris Torek
torek at elf.ee.lbl.gov
Mon Jun 24 19:15:58 AEST 1991
In an article whose referent is long gone, I described this hash function:
for (each character)
hash = (hash * 33) + (the character);
hash_index = hash & ((some power of two) - 1);
and noted:
>>Maybe there are some theory people out there who can explain [why 33
>>works well].
In article <15027 at ulysses.att.com> kpv at ulysses.att.com (Phong Vo[drew]) writes:
>A good property for a hash function is if you keep feeding it
>a constant character, it performs as if it's a random number generator.
>The hash function f(h,c) = 33*h + c is derived from a class of random
>number generators called linear congruential. At the start of Knuth Vol 2,
>there is a good discussion of these generators.
... and indeed, if you turn to Knuth vol. 2, chapter 3, you will find all
sorts of interesting stuff, such as a proof that the low-order n bits of
the usual rand() repeat with period $2^n$. (rand() uses a modulus of
either $2^15$ or $2^31$.)
>... Assuming that c is some fixed odd value, Theorem A guarantees that
>33 is good because (33-1)%4 == 0. ... Here good means that if you keep
>feeding c, you'll cycle thru the entire set of values [0,2^x).
This is approaching what I was getting at, but is not quite there.
Basically it says that 33 is `good' because it is `not bad'. An LCRNG
is defined as
h = ( a h + c ) mod m (where h is each a value of the
n+1 n n
variable `hash' in the for loop)
(in quote `>' above, $m = 2^x$). With a `bad' multiplier for $a$---and
what is `bad' depends on $c$ and $m$---the generator will not produce all
$m$ possible values; with a `good' one it will. Intuitively, then, the
`bad' one will miss some of the hash buckets entirely; the `good' one
will not.
But *why* is it true that `A good property is [that] it performs as if
it's a [good] random number generator'? This is now `obvious' (from the
above result) *if* we feed in constant values of $c$, i.e., if all the
strings are made of the same characters, but are different lengths. But
we rarely do that---the strings we hash are full of weird junk. How do
LCRNGs get in there? Why do the varying values of $c$ not mess things
up? (Of course, they do, but not enough to matter. Why not?)
>At this point, we depart from theory. We typically hash byte strings ...
Actually, I typically hash ASCII text. 33 is the `easiest' `good'
multiplier (as defined above) that is >= 26. I have no idea whether
this is significant, or merely coincidental.
>... This suggests that some value larger than 2^8 (but not too large)
>may do better as a multiplicator. ...
> hash = 0;
> for(each byte c)
> hash = (hash<<8) + (hash<<2) + hash + c;
I will see if I can get this one run on the same data as the `33 hash'.
This is more work than
hash = (hash << 5) + hash + c;
and distribution improvements may be outweighed by additional hash time
(or may not: who knows?).
--
In-Real-Life: Chris Torek, Lawrence Berkeley Lab CSE/EE (+1 415 486 5427)
Berkeley, CA Domain: torek at ee.lbl.gov
More information about the Comp.lang.c
mailing list