Need help ** removing duplicate rows **

Dan Bernstein brnstnd at kramden.acf.nyu.edu
Wed Oct 31 16:18:32 AEST 1990


In article <10182 at jpl-devvax.JPL.NASA.GOV> lwall at jpl-devvax.JPL.NASA.GOV (Larry Wall) writes:
> In article <1990Oct31.003627.641 at iwarp.intel.com> merlyn at iwarp.intel.com (Randal Schwartz) writes:
> : In article <1990Oct30.234654.23547 at agate.berkeley.edu>, c60b-3ac at web (Eric Thompson) writes:
      [ if multiple (consecutive?) rows of colon-separated columns ]
      [ have the same second column, scrap 'em ]
> : perl -ne '($a,$b,$c) = split(":",$_,3); print unless $seen{$a,$c}++;'
> : Fast enough?
  [ as happens with every Perl program posted to the net, Larry points ]
  [ out how inefficient this can be: ]
> Maybe, but he said they were very long files, and that may mean more than
> you'd want to store in an associative array, even with virtual memory.
> Presuming the files are sorted reasonably, you can get away with this:
> perl -ne '($this = $_) =~ s/:[^:]*//; print if $this ne $that; $that = $this'

That does look like what Eric was asking for, but what if the file is
not sorted? Is there a fast Perl solution?

> Of course, someone will post a solution using cut and uniq, which will be
> fine if you don't mind losing the second field.  Or swapping the first
> two fields around. 

cut? uniq? Why? There's already a tool perfectly matched to the job:

  sort -u -t: +0 -1 +2

sort already knows how to work in limited memory. If the input is
already sorted,

  sort -m -u -t: +0 -1 +2

should do the trick. Both of these solutions are easy to figure out, easy
to type, very fast even on long files, and quite portable.

> I'll leave the awk and sed solutions to someone else.

Yes, I seem to always be defending the classic tools against this
onslaught of Perl code that nobody but you can ever optimize.

---Dan



More information about the Comp.unix.questions mailing list