Is 4.2BSD a failure?

Sat Jan 26 21:11:19 AEST 1985

Is 4.x a failure ... NO
IS 4.x a high performance system ... MAYBE YES and MAYBE NO

People forget that 4.1 and 4.2 were paid for and tuned for AI Vision and CAD/CAM
projects sponsored by ARPA and various compainies. For the job mix that 4.x
systems are tuned for it is the ONLY way to do those jobs on a UNIX machine
with any cost effectiveness.

Many tradeoffs that go into 4.x systems are directly counter the best ways
to tune UNIX systems for development environments ... but they were the
only way to make things work for the target applications.

The converse is also true to a large extent ... V7/SIII/S5 kernels don't
handle large applications well or at all --- try running a 6mb application
on an older bell system with swapping .... it takes many seconds for a
single swap in/out.

But exactly the same problem arises out of blindly using a paging system for
the typical mail/news/editor/compiler type support or development machine.
In this environment the typical job size is 5-50 pages with an 75-95% working
set ... when the total working set for the virtual address space approaches
the real memory size on a 4.x system running these small jobs the system
goes unstable causing any program doing disk I/O loose one or more critical
pages from it's working set while it waits for its disk i/o (either requested
via a system call ... or just from page faulting).

The result is a step degradation in system throughput and an interesting
non-linear load curve with LOTS of hysterisis AND a sharp load limit at
about 150-250% of real memory. In comparison a swap based system linearly
degrades smoothly at 1/#users for most systems ... up to a limit. On
Swap based systems if the swap in + swap out time excedes the scheduling
quantum (several seconds on most unix systems) then even a swap based
system can trash and show a simular step degradation, non-linear load curve,
hysterisis, and the load limit. This was evident on ONYX because the
burst disk thruput was limited by the z80 controller to a 3:1 interleave
or about 180kb/sec .... memory was relatively cheap compared to fast disks
in 1980 so we sold lots of memory.

This was evident on the Fortune VAX running 4.1 after several months of
intensive load analysis tracking load factor results and instrumenting
the disk subsystem. 4.1's favorite trick is to have a step increase in load
factors from between 1-4 to 10-20 with little time any where in between.
On the Fortune Vax this was caused by an interaction between paging and
filesystem traffic on the root spindle when the average service time
in the disk queue exceded the memory reapers quantum. A careful policy
of relayout of the filesystems and regular dump/restore of filesystems to
keep them sequential and optimally packed kept teh filesystem (read disk
subsystem) thruput high enough the step degadation (step increase in load
factors) would not occur and we then seldom saw load factors of 10-20, and
only then with a linear rise in load. I have seen the same problem on
most other 4.1 systems ... particularly those with a single spindle and
small memory configurations (less than 2mb).

Most vax systems run 35-50 transactions per second average to the entire
disk subsystem ... a swap system handling a 40k process will typically take
one/two transactions ... a paging system 40 or more depending on the thrashing
level. The working set theory CORRECTLY predicts such poor behavior for
such small programs with large percentages of active pages. If it is required
to run several very large images (CAD/CAM, vision or other high res graphical
application) with 2-8 mbyte arrays ... then the working set theory combined
with processor speed/memory size predictors make paging a clear choice.

Much of the speed difference of 4.1 over v7 and SIII/S5 was simply the
1k filesystem. For older PDP11's the per block processing time for most
512 byte sectors was several times the transaction period .... IE ...
it took several 6-10 milliseconds of cpu time to digest a block which
was tradedoff against filesystem thruput and memory constraints of 256kb
max system size. The advent of much faster processors and much larger
system memory made using 1k blocks necessary and practical where your
system didn't have the cycles or space before. For those of us
mothering 11/45's in the 70's this was a very difficult tradeoff ...
we had kernels of about 70kb leaving less than 180kb to support 2-6 incore
processes/users ... or in todays terms ... nor more than 2/3 happy vi users.
increasing the filesystem size to 1k would increase memory overhead by
6-10k in the kernel and 2-4k in each process ... or ONE LESS VI DATA/STACK
segment --- a major reduction in the number or incore jobs 30-50% and much
more swapping and response time delays.

Today with relatively cheap ram ... only the smallest systems need worry
this problem ... and then a mix of swapping (for jobs less than 150-500kb)
and paging (for jobs greater than 150-500kb) will make most of these
problems go away.

As for the 4.2 "fast filesystem" ... it was again tuned to make large file
transaction run at an acceptable rate .... try to load/process a 4mb vision
or cad/cam file at 30-50 1k block transactions per second -- it will run
SLOW compared to a production system with contigous files. A number of
tradeoffs were made to help large file I/O and improve the transaction
rates on very loaded systems (LIKE ucb ernie ... the slowest UNIX
system I have ever used .... even my 11/23 running on floppies was
faster). But for most of us -- particularly us small machine types ..
PDP11/23's, 11/73's, ONYX's, Fortune's, Tandy 16B's ... and a number of
other commercial systems (including VAX 11/730, 11/750, and micro vaxs)
which run 1-8 users ... the 4.2 filesystem is VERY SLOW and gets SLOWER
much faster over time than a v7/4.1 filesystem.

The tradeoff here is that "locality of reference" is much smaller and well
defined on smaller systems ... on larger systems (like ernie) the disk queue
has a large number of requests spread across the entire disk with a much
broader locality of reference. The 4.2 filesystem attempts to remove
certain bimodal or n-modal access patterns based on the FACT it doesn't
much mater where the data is for reading ... but it is better to write
it without generating a seek .... for systems with large disk request
queues. This doesn't hold up on small systems where much of the
time there is a single active reader using the disk subsystem.
On the small system locality of reference is the entire key to throughput,
thus randomly allocating files wherever is a great loss.

I have spent most of my 10 years of UNIX systems hacking, porting and tuning
on smaller systems. Other than CAD markets I don't see much use for paging 
systems, and as a result view 4.1/4.2 as only a hinderance due to the tendancy
of some firms to put all the bells and toys into their system. This has
been a disaster for several firms who got side tracked by Berkeley
grads and hangers on.

But in the big system markets ... particularly CAD/CAM, highres graphics,
large multiuser system (30-200 users), and AI/Lisp markets 4.2 may be the
only alternative ... it would be a mistake to drag the standard unix
blindly down the 4.2 path ... 99.99% of the unix systems either delivered
today or built in the next couple years would be hurt badly by it.
It would make the number one alternative to UNIX on smaller systems
ONLY MSDOS -- not such a bad system ... but lets keep it in its place too.

I have a lot of interesting numbers and recomendations in performance areas ...
I was going to give them in a talk at Dallas but they saw fit to cancel
it after requireing a formal paper for the unplanned proceedings without
any notice ... and then having a two page draft lost in the mail.
I don't feel to bad about it since appearently 8 other speakers were
also accepted and dropped because they couldn't get papers written and
approved in the several day to 2week window. I hope that next time they
put out a call for presentations USENIX lets people know in advance
papers are required and don't change the rules in the 11th hour if they
say they are not. Most of us can't write a GOOD 5-10 page paper with a
24 hour deadline which is basiclly what they asked of speakers this time --
other than those who had already done a paper for some reason.

Good nite ... have fun

John Bass
Systems Consultant
(408)996-0557