strcmp

Fri Jun 21 22:04:36 AEST 1991

In article <677424380 at romeo.cs.duke.edu>, drh at duke.cs.duke.edu (D. Richard Hipp) writes:
> I have, on various occasions, implemented my own string comparison
> routines which attempt to address the above deficiencies in strcmp.
> (One such implementation, strpbcmp -- string compare in PhoneBook order,
> is attached.)

The routine posted does NOT compare strings in Phone Book order.
Here are the rules from a Phone Book:
    Names are divided into two parts for sorting.
    The first part, or the first word, determines the place to
    find the name.  The second part, all the initials or
    remaining words (including locality and telephone number)
    determine the order within that group.

    Business names which begin with "the" are generally sorted
    under the next word.

    Punctiation and special characters within a name will generally
    not alter their alphabetical position and should be ignored.

    When initials precede a name, they will be treated as the first
    name, regardless of punctuation.

    If the name contains a number, the numeric character will be
    sorted as though it were a word (i.e. 1 = one).  In some cases,
    names which commence with numerals will be found under the name
    as it is pronounced.

    A prefix is included as part of the first word even if it is
    separated from the second part of the name by a hyphen.
    (This one is _really_ fun.  You have to know that "Le Blanc"
    has a prefix "Le" while "Le Tseung" probably hasn't, so that
    the latter name precedes the first.)

    Names which contain a hyphen are treated as two words and are
    sorted according to the first name.  This does not apply to
    hyphenated names which begin with a prefix.

    "Mc" is treated as though spelt "Mac".  Names such as "Mace"
    and "Mack" are sorted with those names which commence with
    "Mc" and "Mac".
    "Mt" is treated as though spelt "Mount"  Names such as "Mount"
    appear first, followed by names which have "Mt" or "Mount" as
    the first part of their name.
    Names beginning with "St" are treated as though beginning
    with "Saint" (same rules as Mt/Mount).

This isn't really adequate; McDonald may also be spelled M'Donald,
and "St" is sometimes abbreviated to S, so "S. Adam Parish School"
should be sorted with "Saint-Adam", but isn't.

It is worth noting that 'phone book order is not the same as dictionary
order.  There really wasn't any one order that C could have used.

> I therefore request option from the net on what others think is the
> one right, true, and proper way to compare strings.

There isn't any.  You might like to imitate the approach in ANSI C.
There are two functions which give you access to the local collating
method (see setlocale() / LC_COLLATE).  There is a function strxfrm():
	strxfrm(dest, source, /*length? I forget*/)
produces in dest a ``normalised'' copy of source, and returns the
length of this copy.  Comparing two normalised copies using strcmp()
then does the right thing.
	strcoll(s1, s2)
has the same effect as normlising s1 and s2 separately, then comparing
them with strcmp.  What you want to do is to provide any number of
normalising functions that take your fancy, and use strcmp() to
compare normalised results.  If you do it this way, then you can also
use your comparison method with an external sort:  when you write the
file to be sorted, put the normalised version first, then a mark, then
the real data.  Sort (letting the external sort use the same rule as
strcmp), then strip off the normalised prefixes.

Note:  when you are sorting, you want the very fastest comparison you
can get.  Sorting a bunch of names by normalising them, then sorting
the normalised versions using strcmp(), is going to be a *LOT* faster
than sorting using your strpbcmp or anything like it.
-- 
I agree with Jim Giles about many of the deficiencies of present UNIX.