Perfect HASH functions.....

Wed Sep 27 09:11:27 AEST 1989

>From article <1989Sep24.214153.8867 at rpi.edu>, by flynn at clotho.acm.rpi.edu (Kevin Lincoln Flynn):
> As long as people are talking hashing, can anyone think of a GOOD way to hash
> 16-bit numbers determined dynamically at runtime into a fast way to look them
> up?  Specifically, I'm talking about the attribute dictionary of an object in
> an OO system we're working on here -- the tags of attributes are 16-bit handles
> about which nothing is known at runtime.  Someone suggested closed hashing,
> with the number of buckets growing by roughly doubling primes....  ie start
> with two buckets, then 3, then 7, then 17, etc.....  any ideas out there?
> (oh yeah, the hash function as well as the strategy'd by nice.  [ :) ])
> 

Reluctant as I am to contribute to your delinquency by aiding and abetting
dynamic attribute binding, well okay. :-|

For a hash-function, it depends on the form of those 16-bit numbers.
If all the bits are more or less equally important, you may just want
to multiply by a peculiar number. Try 9821. Read Sedgewick's "Algorithms"
or something by Knuth on pseudo-random number generation. Do some experiments
to see what kind of collision rates you get with different numbers and
different table-sizes. Numbers that end with x21, where x is even may avoid
some problems. I don't pretend to understand this.

Now then. Concerning the strategy....

I have never been convinced that it is advantageous to use prime numbers
for sizing hash tables. Nobody has ever been able to do much more by way
of convincing than to wave books at me. But when you read those books, they
just say it's a Good Thing. No analysis of why it is a Good Thing. My
reasoning is that if the hash-function is good, then the algebraic properties
of the size of the table don't matter. Random squared is random. Or in this
case, pseudo-random squared is pseudo-random.

Why am I telling you all this?

Well, even on some modern machines, dividing is expensive. I just ran a
little test on my Sun 3/60 and it did 10,000,000 bitwise "&" operations
five and a half times as fast as it did the same number of divide
operations.

I have one hash-table package that I use almost exclusively
and it seems to give very good results. It keeps tables that are
powers of two. Therefore to mod out the size of the table, it masks the
hash-value with one & operation. I keep the table strictly less than
half full, in order to keep the number of collisions low.

The less-than-half-full property allows me to use a rather peculiar
rehashing scheme that I dreamed up. It is based on a very remarkable
property that I discovered, I don't know how. Too much coffee, probably.
The property is that if M is a power of two, then the sequence

               S(0) = 0
               S(n+1) = ((S(n)+1)*3) % M

repeats after exactly M/2 steps. Since the "% M" can be done with
an & operation, and the "*3" can be done with a shift and add, this
is a fast rehash to calculate. And the algorithm does not have to check to
see if it has cycled, because by keeping the table half full, we have
guaranteed that the this sequence will have an empty slot in it.

I've got a routine that puts something in the table if nothing else
equivalent was there, or returns the equivalent thing. And a routine
that replaces the equivalent thing and returns the old value, a routine
that just looks for something, etc., etc...  I even have a routine that
removes entries. The average time (which should be always, if the hash-
function is good) is O(1) for both insertion and removal. Worst case for
insertion is O(n) and for removal is O(n**2), but that doesn't matter,
of course.  Here's a sample. (The way the bit-mask for modding out the
power of two is calculated assumes a two's complement machine. I don't
ever expect to use the other kind.)

And oh yes. Please forgive the dynamic binding of the "hash" and "eq"
attributes. In the OO attribute looker-upper you will no doubt
want to do the right thing and expand the hash and eqivalence functions
in line.

#define HASH(cont) (((*(obj->hash))(cont)) & obj->mask )
#define REHASH(num)  (((((num)+1)*3) & obj->mask) )

static Ptr* new_table();
static int round_up_to_pwr2();
static void Hash_overflow();

Hash*
Hash_init_sized(obj, hash, eq, initial_table_size)
     register Hash* obj;
     int (*hash)();
     Bool (*eq)();
{
  if(initial_table_size < 4) initial_table_size = 4;
  obj->size = round_up_to_pwr2(initial_table_size);
  obj->num_entries = 0;
  obj->max_entries = obj->size/2 - 1;
  obj->mask = obj->size - 1;
  obj->hash_table = new_table(obj->size);
  obj->hash = hash;
  obj->eq = eq;
  return obj;
}

/* Put a new entry into the table. If there was previously an
 * equivalent entry in the table, return it.
 * If there was not, return (Ptr)0.
 */

Ptr
Hash_put(obj, entry )
  register Hash* obj;
  Ptr entry;
{

  register int bucket_number;
  register Ptr* bucket;

  bucket_number = HASH(entry);

  while(1)
    {
      bucket = obj->hash_table + bucket_number;

      if ( *bucket == (Ptr)0 )
	{ 
	  *bucket = entry;
	  obj->num_entries++;
	  if ( obj->num_entries > obj->max_entries )
	    Hash_overflow(obj);
	  return (Ptr)0;  /* <======== added new entry */
	}

      if ( !obj->eq( entry, *bucket ) )
	{ 
	  bucket_number = REHASH(bucket_number);
	  continue; /* <====== search some more (collision) */
	}

      /* Found old Ptr. Replace. */
      { 
	Ptr old = *bucket;
	*bucket = entry;
	return old; /* <============== replaced entry */
      }

    }

}