hash algorithm

Thu Aug 18 12:18:47 AEST 1988

>From article <2550078 at hpisod2.HP.COM>, by decot at hpisod2.HP.COM (Dave Decot):
>> could anyone recommend a hashing algorithm to store a list of 10
>> digit phone numbers (area code plus 7 digits).
>> It should have little or no overflow and no collisions (as few as
>> possible). The list will contain more than one area code but
>> about 1000 numbers per area code (ie not a totasly random sample
>> but one which has a relatively common prefix)
>> thank you very much.
> 
> Treat the ten digits as an integer, n, and compute the index
> as (n % HASH_SIZE), where HASH_SIZE is the size of your hash table.
> 
> Store the number, and the record for that number, in the first empty
> hash table location at or after the index you've computed.

...

There are a couple of rather serious problems with the algorithm suggested.
One is that the high order bits in the phone numbers will not affect
the hash.  Say for example you store all of radio station KGO's talk 
numbers, (415) 263-TALK, (408) 469-TALK, etc. (or whatever they are).
They all map to the same slot.

Secondly, the "first-empty-slot-after" algorithm tends to build up clumps
which the subsequent linear searches have to trace through, adding to
the size of the clump, and thus the likelyhood that it will get hit
next time.

I have devised a method for hashing which has not been published previously.
It is very very fast. Here is the WORLD PREMIERE!!

First let's settle the business of how to hash integers into the
range 0..M-1.  I suggest that you use hash(I) = (I*32821) % M

The number 32821 has some magic properties that spatter things about nicely.
See "Algorithms" by Sedgewick.

My hashing technique uses a table-size M which is always a power of two.
This makes the modulo-function (%) very fast on a two's compliment machine,
if you are willing to accept a tiny bit of machine-dependance. (I never
expect to use a machine that is not two's compliment, so I'm not too 
scared.)

You code the mod function as

	H & mask

rather than 

        H % table_size

where mask is table_size-1.

For a "rehash" function, rather that looking at the next slot, I use the
function defined by the following recursive definition:

         rehash(0,slot) = slot
         rehash(i+1, slot) = ((rehash(i,slot)+1)*3) mod M

When M is a power of two, this fuction cycles after precisely M/2 steps.
I have proved as much, but my proof is tedious. The range of the
sequence is the set of numbers in 0..M-1 which are congruent to
slot or slot+3 modulo 4.  Don't ask me how I discovered this; I think
it came to me in a vision.

When the table gets half full, I double its size and hash everything
into the new table.

The routines allow for removing entries, as well as adding them
and looking them up, but I have omited the removal routine, which is
quite tricky.

Here is an excerpt from the actual code: (For purposes of the telephone
number problem, we may assume that "entry" is a record which contains
a telephone number and some other stuff, and that obj->eq is a pointer
to a function that returns true when two entries have the same tele-number
in them and false otherwise.  obj->hash is a pointer to a function that 
multiplies the tele number in an entry by 32821.)

The following provably terminates because the rehash sequence hits
exactly half the slots, and the table is kept less than half full.

/* CAVEAT: The following assumes that integers are two's complement. */
#define HASH(cont) (((*(obj->hash))(cont)) & obj->mask )
#define REHASH(num)  (((((num)+1)*3) & obj->mask) )

Ptr
Hash_put(obj, entry )
  register Hash* obj;
  Ptr entry;
{

  register int bucket_number;
  register Ptr* bucket;

  bucket_number = HASH(entry);

  while(1)
    {
      bucket = obj->hash_table + bucket_number;

      if ( *bucket == (Ptr)0 )
	{ 
	  *bucket = entry;
	  obj->num_entries++;
	  if ( obj->num_entries > obj->max_entries )
	    Hash_overflow(obj);  /* double size of table */
	  return (Ptr)0;  /* <======== added new entry */
	}

      if ( !obj->eq( entry, *bucket ) )
	{ 
	  bucket_number = REHASH(bucket_number);
	  continue; /* <====== search some more (collision) */
	}

      /* Found old Ptr. Replace. */
      { 
	Ptr old = *bucket;
	*bucket = entry;
	return old; /* <============== replaced old entry */
      }

    }

}

Ptr
Hash_get(obj, entry )
  register Hash* obj;
  Ptr entry;
{

  register int bucket_number;
  register Ptr* bucket;

  bucket_number = HASH(entry);

  while(1)
    {
      bucket = obj->hash_table + bucket_number;

      if ( *bucket == (Ptr)0 )
	{ 
	  return (Ptr)0; /* <====== entry not found */
	}

      if ( !obj->eq( entry, *bucket) )
	{ 
	  bucket_number = REHASH(bucket_number);
	  continue; /* <====== search some more (collision) */
	}

        return *bucket; /* <====== found entry */
    }

}