hash function

Mon Jun 24 19:15:58 AEST 1991

In an article whose referent is long gone, I described this hash function:
	for (each character)
		hash = (hash * 33) + (the character);
	hash_index = hash & ((some power of two) - 1);

and noted:
>>Maybe there are some theory people out there who can explain [why 33
>>works well].

In article <15027 at ulysses.att.com> kpv at ulysses.att.com (Phong Vo[drew]) writes:
>A good property for a hash function is if you keep feeding it
>a constant character, it performs as if it's a random number generator.
>The hash function f(h,c) = 33*h + c is derived from a class of random
>number generators called linear congruential. At the start of Knuth Vol 2,
>there is a good discussion of these generators.

... and indeed, if you turn to Knuth vol. 2, chapter 3, you will find all
sorts of interesting stuff, such as a proof that the low-order n bits of
the usual rand() repeat with period $2^n$.  (rand() uses a modulus of
either $2^15$ or $2^31$.)

>... Assuming that c is some fixed odd value, Theorem A guarantees that
>33 is good because (33-1)%4 == 0. ... Here good means that if you keep
>feeding c, you'll cycle thru the entire set of values [0,2^x).

This is approaching what I was getting at, but is not quite there.
Basically it says that 33 is `good' because it is `not bad'.  An LCRNG
is defined as

	h    = ( a h  +  c ) mod m	(where h  is each a value of the
	 n+1        n				n

					 variable `hash' in the for loop)

(in quote `>' above, $m = 2^x$).  With a `bad' multiplier for $a$---and
what is `bad' depends on $c$ and $m$---the generator will not produce all
$m$ possible values; with a `good' one it will.  Intuitively, then, the
`bad' one will miss some of the hash buckets entirely; the `good' one
will not.

But *why* is it true that `A good property is [that] it performs as if
it's a [good] random number generator'?  This is now `obvious' (from the
above result) *if* we feed in constant values of $c$, i.e., if all the
strings are made of the same characters, but are different lengths.  But
we rarely do that---the strings we hash are full of weird junk.  How do
LCRNGs get in there?  Why do the varying values of $c$ not mess things
up?  (Of course, they do, but not enough to matter.  Why not?)

>At this point, we depart from theory. We typically hash byte strings ...

Actually, I typically hash ASCII text.  33 is the `easiest' `good'
multiplier (as defined above) that is >= 26.  I have no idea whether
this is significant, or merely coincidental.

>... This suggests that some value larger than 2^8 (but not too large)
>may do better as a multiplicator. ...
>	hash = 0;
>	for(each byte c)
>		hash = (hash<<8) + (hash<<2) + hash + c;

I will see if I can get this one run on the same data as the `33 hash'.
This is more work than

		hash = (hash << 5) + hash + c;

and distribution improvements may be outweighed by additional hash time
(or may not: who knows?).
-- 
In-Real-Life: Chris Torek, Lawrence Berkeley Lab CSE/EE (+1 415 486 5427)
Berkeley, CA		Domain:	torek at ee.lbl.gov