Efficient STRing CoMPares?

Thu Mar 21 07:31:33 AEST 1991

In article <1991Mar20.174452.4246 at zoo.toronto.edu> henry at zoo.toronto.edu (Henry Spencer) writes:
>In article <1991Mar19.034802.24340 at cbnewsk.att.com> barton at cbnewsk.att.com (jim.m.barton) writes:
>>... The above macro, StrEq(), MIGHT be faster if most
>>of the strings being compared tend to differ in their first character...
>>IMHO, I doubt that StrEq() will achieve any significant optimization for
>>MOST cases...
>Sorry, there is considerable experimental evidence that it does.  The fact
>is, most string comparisons fail on the very first character, and even a

An easy method for verifying that two strings typically differ
in their first character run the following (dumb) sorting program.
For the first 10k words out of /usr/dict/words scrambled up in
essentially random order, it says 89% of the comparisons failed on their
first character. 

This is not much of a test, granted, and there _will_
be some cases (e.g. sorting nearly-sorted lists of words) where
comparisons _wont_ be decided by the first characters. However, even in
these latter cases the overhead of an extra `compare' versus the
overhead of the strcmp loop will be small -- please feel free to
experimentally verify this. 

(As a back of the envelope calculation
we can say that if it costs S instructions for the strcmp code and
C instruction(s) for the compare then on average we will perform
C + .11 S instructions. We can then say ``how much would C have to be
to make it worth using only S?'' We find break even at C = .89 S. Therefore, 
provided the compare sequence (including jumps and register loads) on your 
machine is less than 89% of a (typical) call/inline of strcmp then USE THE 
COMPARE).

As a slight side note there are certain contexts where string comparisons
can be made MUCH more efficient than a character-by-character compare.

The well-known `string search' algorithms that essentially compare
_rare_ characters within each string (e.g. if one string contains a `z'
then first search for `z' in the other string -- if found then match the
rest) & then adjust pointers by an appropriate amount (rather than just 
incrementing them) depending on the outcome of each character comparison is 
one example. 

I remember reading an article in Software Practice and Experience 
that compared such a statistical method with 
string-matching INSTRUCTIONS (I _think_ it was on an IBM 370 so that dates 
me doesnt it?) and the algorithm was found to be SUBSTANTIALLY faster
(like an order-of-magnitude, unless I'm mistaken) than the hardware
string compare.

-kym

/* Dumb sort program that figgas out how many strings (mis)match on 
first character -- it says 11% match for /usr/dict/words. */
#include <stdio.h>
#include <string.h>

#define MAXWORD 10000

char *word[MAXWORD];
int nw=0;
int nmatch1=0;
int ncmp=0;

main(){
	init();
	scramble();
	sort();
	printf("fraction match first char=%g\n",(double)nmatch1/ncmp);
	}

init() {
	int i;
	char buf[100];
	for(i=0;i<MAXWORD;i++){
		gets(buf);
		word[i]=strcpy(malloc(strlen(buf)+1),buf);
		}
	}

scramble() {
	int i;
	for(i=0;i<MAXWORD;i++){
		int j=rand()%MAXWORD;
		char *tmp=word[i];
		word[i]=word[j];
		word[j]=tmp;
		}
	}

sort() {
	int i,j,k;
	char *tmp;
	for(i=0;i<MAXWORD;i++){
		j=i;
		tmp=word[j];
		for(k=j+1;k<MAXWORD;k++)
		if(cmp(word[k],tmp)<0) j=k,tmp=word[j];
		word[j]=word[i];
		word[i]=tmp;
		}
	}

cmp(s1,s2) char *s1,*s2; {
	++ncmp;
	if(*s1==*s2) ++nmatch1;
	return strcmp(s1,s2);
	}