Unnecessary tar-compress-uuencodes

Lars Henrik Mathiesen thorinn at skinfaxe.diku.dk
Wed Jul 11 04:25:46 AEST 1990


tneff at bfmny0.BFM.COM (Tom Neff) writes:
>[Many good reasons not to tar-compress-uuencode source and other
>plain text in news postings.]
>
> * Compressed newsfeeds, which already impart whatever transmission
>   efficiency gain LZW can offer, are circumvented and in fact
>   sandbagged by the pre-compression of data.

That turns out not to be the case. It is true that a compressed file
will usually expand if it is compressed again. But the intervening
uuencode is very important: Compressing a uuencoded file is somewhat
independent of compressing the original (*). I made an experiment with
a tar of a directory tree with mixed source, binaries, and images.

	name	       size		crummy ASCII graphics
	----------  -------		---------------------
	tar	    4718592	tar	 ------- -60.3% ------>	tar.Z
				   |                                |   
	tar.Z	    1874378	+37.8%				 +37.8%
				   |				    |
	tar.uu	    6501192	   V				    V
				tar.uu	 ------- -60.3% ------>	tar.Z.uu
	tar.Z.uu    2582500	   |				    |
				-63.2%				 -13.7%   
	tar.uu.Z    2392701	   |				    |     
				   V				    V
	tar.Z.uu.Z  2229065	tar.uu.Z -------  -6.8% ------>	tar.Z.uu.Z

Of course, compression factors will vary widely; I have made this
experiment several times, with the same picture emerging: It pays to
compress before uuencoding, and it pays to compress after, and it pays
best to do both.

In words: If you have to post uuencoded stuff (tar archives, images,
whatever), COMPRESS them first. It is always better: In terms of
storage on intermediate nodes and of transmission on non-compressed
links it is very much better; it may not save much on compressed
links, but it doesn't hurt (contrary to common assertions), and the
small saving may still pay for the cost to run compress (and compress
has less data to process, anyway, so it doesn't run for so long).

I wish this misconception about the badness of compressed uuencoded
data on compressed news links would go away; anyone for a news.config
FAQ posting?
______________________________________________________________________
(*) An attempt at an explanation: The uuencode process maps the source
bytes into a smaller set (64 symbols), and it maps three source bytes
into four and puts in newlines. Compress works by finding common byte
sequences and mapping them into symbols. A common source sequence will
occur in three different ``phases'' after uuencode, and may be broken
by newlines, so compress will not find it as easily. Of course, long
sequences of identical bytes, as often in images, are immune to the
shift effect.

On the other hand, a 16-bit compress should be able to map all the
2-symbol uuencode sequences and about one fourth of the 3-symbol ones
into a 16-bit symbol, giving a compression of about 12% on the
uuencode of a totally random byte sequence. (Running compress after
compress-uuencode usually gives between 11% and 14% compression,
bearing this out; for this purpose, the first compress effectively
gives a random sequence.)

So: compress may get more of the ``available compression'' in a given
input if it is run before uuencode. On the other hand, compress will
be able to undo some of the expansion caused by uuencode, masking the
first effect.

--
Lars Mathiesen, DIKU, U of Copenhagen, Denmark      [uunet!]mcsun!diku!thorinn
Institute of Datalogy -- we're scientists, not engineers.      thorinn at diku.dk



More information about the Alt.sources.d mailing list