Sort bug causes data loss

Griff Smith ggs at ulysses.att.com
Wed Sep 19 06:20:56 AEST 1990



In article <2675 at crdos1.crd.ge.COM>, davidsen at crdos1.crd.ge.COM (Wm E Davidsen Jr) writes:
> 
>   I have discovered what appears to be a serious bug in the sort
> routine used in several SysV variants including Stellar. Since it
> causes silent loss of data I am cross posting a bit more than I usually
> do.
> 
[deleted some details to save space, followed by test script...]
> 
> sort -nu <<XX >x$$.tmp
>   1: a
>   3: b
>   2: c
>   1: a
>  10: x
> XX
> 
>   Of course someone may tell me it's supposed to work that way, and that
> the BSD version is broken.

I suspect this may be the case.  The system V manual page says this about
the -u option:

     -u	  Unique: suppress all but one in each set of lines hav-
	  ing equal keys.

This doesn't agree with the code, though.  The real behavior matches what
I find in the BSD manual page:

     u	  Suppress all but  one	 in  each  set	of  equal  lines.
	  Ignored bytes	and bytes outside keys do not participate
	  in this comparison.

The next clue is from the System V manual page again:

     -n	  An initial numeric string, consisting	of optional
	  blanks, optional minus sign, and zero	or more	digits
	  with optional	decimal	point, is sorted by arithmetic
	  value.  The -n option	implies	the -b option (see
	  below).  Note	that the -b option is only effective when
	  restricted sort key specifications are in effect.

The tricky point is that a numeric comparison stops as soon as it finds
a non-numeric character.  Since your test file has leading blanks, and
you didn't specify a sort key, the numeric comparison stops when it
sees the leading blank in each record; the test file appears to
contain five empty records as seen by the numeric comparison code.
Furthermore, the -u option suppresses the following escape clause in
the manual page:

     When there	are multiple sort keys,	later keys  are	 compared
     only  after all earlier keys compare equal.  Lines	that oth-
     erwise compare equal are ordered with all bytes significant.

Translation: if a numeric comparison, or a set of keyed comparisons,
shows that two records match, `sort' then compares both records as
simple text to determine whether the records are really identical.  
This `tie breaking' test is suppressed if the -u option is enabled.

Since all five of your test lines appear to be identical, the -u
option deletes all but one of them.  I think the command you want
to use is

	sort -nu +0

This forces a trip through the key finder, which activates the code
that strips leading blanks.

> -- 
> bill davidsen	(davidsen at crdos1.crd.GE.COM -or- uunet!crdgw1!crdos1!davidsen)
>     VMS is a text-only adventure game. If you win you can use unix.

Flames, counter arguments, cheerfully accepted.  I didn't write the
rules, I just work here.
-- 
Griff Smith	AT&T (Bell Laboratories), Murray Hill
Phone:		1-201-582-7736
UUCP:		{most AT&T sites}!ulysses!ggs
Internet:	ggs at ulysses.att.com



More information about the Comp.bugs.sys5 mailing list