'HASH'ing techniques.

Wonderly gregg at ihlpb.ATT.COM
Fri Dec 30 03:46:18 AEST 1988


>From article <1687NU013809 at NDSUVM1>, by NU013809 at NDSUVM1.BITNET (Greg Wettstein):
>In following discussions in comp.lang.c I have frequently seen discussions
>involving 'hashing' techniques and algorithms.  These discussions were
>particularly prevalent when the shift ( << >> ) operators were being discussed.
> 
>While I have done a fair amount of programming in C I have never utilized
>'hashing' algorithms in any of my applications up until now.  I was given a
>piece of application code just before the holiday season which needs some work
>to smooth out some bugs which user's have been reporting.  One module of the
>code is heavily dependent on a function which purports to 'HASH' a series of
>input strings.
> 
>As near as I can tell from the code the 'hashing' function seems to generate
>a unique numerical value for each string.  Is this the intended purpose of
>hashing or have I misinterpreted the code?  The code seems to be dependent on
>shifting an input character and on a table of prime numbers.

The purpose of hashing is to provide a mapping from a very large in
range, but sparsely populated set to a smaller set in such a way as to
make it both efficient and practical to deal with the sparse data.

Strings happen to be a good example (if not the classic) of a sparse
set.  The classic method of hashing is to 'fold' the string into an
integer by using each character's numeric representation, ASCII, EBCDIC
or other, as a factor of an equation.  Typically, the equation is
recursive in that one of the factors is the previous value of the equation.
Put more simply, and in C

	value = 0;

	while (*s)
		value += *s++;

This fragment merely adds the characters together, yielding an integer
value which is unique (that's the important part) for a particular
string pointed to by 's'.  This equation is rather simple, and can yield
poorly distributed mappings.  However, in the general case it is enough
to do a respectable job.  You can use shifting and bit masking to make
sure that degenerate cases, those where the strings contain the same
characters, still yield varying values.  e.g.  the above method yields
the same value for all of the strings,

	string tsring gnirts gnitsr

and any other string with exactly those characters.

The use of this integer value is the next problem.  There are several
things that this integer can represent.  The most classic is an array
index, in which case the modulo operatation is applied to the value using
the size of the array.  The result is then an index into an array, which
is within the bounds of the array.

The problem with hashing is that the mapping from set to set is onto,
but not one-to-one.  Therefore, more than one element from the original
set may map onto the same element (array index in this example) of the
resulting set (called a collision).

There are many different was to handle collisions.  If you know how
many items will be in the original set, you can make you result array
just big enough to hold all the items.  When a collision occurs, you
apply the equation

	value' = (value + n) % m

to value to get value' which is a different, but well defined, value,
until you find an empty slot.  'n' is chosen to be prime to 'm' (the
array size) so that all elements of the array are examined before there is
any repetition.

It turns out that this is fairly good in terms of performance until the
array gets to be about 60-70% full.  At this point, the collision resolution
starts costing a lot.

The better if not best method is to make the array be an array of
pointers (this method is called chaining) which make up a linear linked
list.  Whether or not a collision occurs, the item is inserted at the
beginning (the list may also be ordered) of the list.

The big plus here is that collisions no longer cause degenerate behavior
for all items inserted in the future.  Only the items on a particular
list will suffer performance hits.

There is a lot more to hashing including how it can be used on secondary
storage to guarantee one and only one disk access.  If you really want
a lot more information, see the textbooks...


-- 
It isn't the DREAM that NASA's missing...  DOMAIN: gregg at ihlpb.att.com
It's a direction!                          UUCP:   att!ihlpb!gregg



More information about the Comp.lang.c mailing list