Compressed executables

Tue Jan 22 23:04:04 AEST 1991

In article <3977 at skye.ed.ac.uk> richard at aiai.UUCP (Richard Tobin) writes:
>In article <118868 at uunet.UU.NET> rbj at uunet.UU.NET (Root Boy Jim) writes:
>>I once suggested to Chris Torek that the kernel should execute
>>compressed programs. He groaned.
>
>This has been done by Acorn in their RISC iX port of 4.3.  The compression
>is done in such a way that the file can still be demand paged.  
>
>Perhaps someone from Acorn could give more details?

Yes.  Here they are:

The machines on which RISC iX is currently implemented have
"interesting" memory management hardware.  Each bank of 4Mbytes of
physical memory is handled by a MEMC (Memory Controller) chip in such
a way that every physical page is always mapped to some logical page
(yes, it *is* that way round).  This is done entirely on-chip, using a
CAM in which each entry controls one physical page: if the entry's
contents (logical page number) matches the logical page address put
out by the ARM CPU, then the associated physical page (identified by
the index of the entry in the CAM table) is used for the data
transfer, provided that the associated protection bits allow.  In the
technology used in 1986 when MEMC was first made the practical limit
on CAM entries was 128.  Hence the page size for a 4Mb physical memory
chunk comes out at a whopping 32Kb.  If you are prepared to use more
MEMCs and have each control less memory, the page size goes down (to a
minimum of 4Kb, for 512Kb/MEMC); of course this adds to system cost
for several reasons so Acorn chose not to.

Now the consequences of a 32Kb page size are in many respects
unpleasant, but it is possible to turn some of these to advantage.  In
particular since the first RISC iX machine (the Acorn R140, launched
in early 1989 for approx. 3.5k pounds) had only 50Mb of disc and we
wanted to put as much as possible of 4.3 BSD onto it, we needed to do
some shoehorning.  The problem is exacerbated because ARM, in common
with most RISC architectures, has fairly poor code density (though not
as bad as some).  The solution was to use a code compression technique
which is tuned to the particular typical bit patterns of ARM
instruction sequences and which achieves a reduction of around 50%.
The saving in disc space comes by applying this compression technique
on a per-page basis to demand paged executable files.  Each page as
stored on disc would normally occupy 32Kb or four disc blocks on an
8Kb blocksize filesystem, but compression reduces this to around 16Kb
on average, taking up two or perhaps three 8Kb blocks.  (With the
newer R260 machine we moved to 4Kb blocks in the shipped filesystem,
which further improves the average disc space saving.)  The unused
blocks in a page prototype simply become "holes".  In conjunction with
a simple shared library scheme, the actual disc occupancy of
executables becomes quite tolerable.

The compression operation - we call it "squeezing" - is applied to a
normal demand-load format program as a separate operation (the linker
produces normal uncompressed output, but may immediately invoke the
squeeze program if desired).  An unsqueeze program performs the
inverse operation if required (e.g. for debugging purposes).

Compressed executables comprise a header, the text, the data and an
optional symbol table.  The header includes a magic number to identify
the compressed format and two tables of values for the decompression
algorithm, and is zero padded to the next page boundary.  Each page of
text or data is compressed "in place", i.e. the compressed data starts
at the same offset in the file that the uncompressed data did, and the
symbol table if present also starts at a page-size offset.  

The kernel recognises compressed executables by the magic number at
exec time, and reads in the decompression tables in the header at this
point, via the buffer cache.  The tables are typically not very big,
and are held on disc in a compacted format which is expanded by the
kernel.  To save memory, since most programs use much less than 32Kb
of stack, the expanded tables are kept in user space just before the
actual process's user stack.  To demand load a page, the kernel reads
in the blocks containing the compressed code or data directly into the
new page frame (not through the buffer cache) then applies the
decompression algorithm on the data.  Since the data starts at the
same file address regardless of compression, there is no problem
finding it.  A checksum kept with the compressed data helps protect
against problems such as the possibility, practically never seen, of a
wayward program corrupting the decompression tables in its own stack
area.  The page read operation takes advantage of any adjacency of
disc blocks (which with careful tuning of the filesystem as shipped
happens most of the time) and of course holes take no time to load!
The decompression algorithm is extremely quick, and the net result is
that a page can be ready for use in less wall-clock time than it would
have taken to read in a whole 32Kb of uncompressed data directly.  So
the technique saves both disc space and time, although of course if
the page size were smaller these benefits would probably be outweighed
by other considerations.

------------------------------------------------------------------------
Mark Taunton 					Email:	mark at acorn.co.uk
RISC iX Development Group			
Acorn Computers Ltd
Cambridge, England.