Need help ** removing duplicate rows **

Larry Wall lwall at jpl-devvax.JPL.NASA.GOV
Wed Oct 31 12:26:06 AEST 1990


In article <1990Oct31.003627.641 at iwarp.intel.com> merlyn at iwarp.intel.com (Randal Schwartz) writes:
: In article <1990Oct30.234654.23547 at agate.berkeley.edu>, c60b-3ac at web (Eric Thompson) writes:
: | I have a few very long files that contain rows of ASCII data.  Each row
: | looks something like this (not the actual data here):
: | 
: | a:A:b:c:d:e:f:g:h:i:j:k:l:m
: | a:B:b:c:d:e:f:g:h:i:j:k:l:m
: | a:C:b:c:d:e:f:g:h:i:j:k:l:m
: | a:D:b:c:d:e:f:g:h:i:j:k:l:m
: | b:A:n:o:p:q:s:t:u:v:w:x:y:z
: | c:A:x:a:x:b:x:c:d:a:m:l:v:x
: | d:A:m:l:k:j:i:h:g:f:e:d:c:b
: | d:B:m:l:k:j:i:h:g:f:e:d:c:b
: | d:C:m:l:k:j:i:h:g:f:e:d:c:b
: | 
: | It's the second column that's important.  If there are multiple rows that
: | are exactly the same except for the second column, I want to GET RID of them.
: | If the row is unique (for example, the ones starting with "b" and "c" above)
: | then it should stay.  Sounds like what I need is a way to filter out rows
: | that are duplicate except in the second column.
: 
: A one-liner in Perl:
: 
: perl -ne '($a,$b,$c) = split(":",$_,3); print unless $seen{$a,$c}++;'
: 
: Fast enough?

Maybe, but he said they were very long files, and that may mean more than
you'd want to store in an associative array, even with virtual memory.
Presuming the files are sorted reasonably, you can get away with this:

perl -ne '($this = $_) =~ s/:[^:]*//; print if $this ne $that; $that = $this'

Of course, someone will post a solution using cut and uniq, which will be
fine if you don't mind losing the second field.  Or swapping the first
two fields around.  I'll leave the awk and sed solutions to someone else.

Larry



More information about the Comp.unix.questions mailing list