Why UNIX I/O is so slow (was VAX vs SUN 4 performance)

Alan Klietz alan at mn-at1.k.mn.org
Sat Jun 18 05:16:32 AEST 1988


In article <6963 at cit-vax.Caltech.Edu> mangler at cit-vax.Caltech.Edu (Don Speck) writes:
<In article <23326 at bu-cs.BU.EDU>, bzs at bu-cs.BU.EDU (Barry Shein) writes:

	[why UNIX I/O is so slow compared to big mainframe OS]

A useful model is to partition the time spent by every I/O request
into fixed and variable length portions.   tf is the fixed overhead to
reset the interface hardware, queue the I/O request, wait for the
data to rotate under the head (for networks, the time to process all
of the headers), etc.  td is the marginal cost transferring a unit of
data (byte, block, whatever).  The total I/O utilization of a channel
in this case is characterized by

	        n td
	D = ------------
	     tf + n td

	for n units of data.  The lim  D = 1.0. 
				  n->inf

td is typically very small (microsecs), tf is typically orders
of magnitude higher (millisecs).  The curve usually has a knee;
UNIX I/O is often on the left side of the knee while most mainframe
OS's are on the right side.

<For all the optimizations that these I/O processors are supposed to do,
<Unix rarely gives them the chance.  Unless there's more than two requests
<outstanding at once, once they finish one, there's only one request to
<choose from.  Unix has minimal readahead, so that's as many requests as
<a single process can generate.	Raw I/O is even worse.

Yep, Unix needs to do larger I/O transfers.  Case in point: the Cray-2
has a 16 Gbyte/sec I/O throughput capability with incredibly expensive
80+ Mbit/s parallel-head disks (often stripped).   And yet, typing

	cp bigfile bigfile2

measures a transfer performance of only 18 Mbit/s, because BUFSIZ is 4K.

<Asynchronous reads would be the obvious way to get enough requests in
<the queue to optimize, but that seems unlikely to happen.  Rather,
<explicit read commands are giving way to memory-mapped files (in Mach
<and SunOS 4.0) where readahead becomes synonymous with prepaging.  It
<remains to be seen whether much attention is put into this.

There have been comments that SunOs 4.0 I/O overhead is 2 or 3 times
greater than under 3.0.  Demand paged I/O introduces all of the Turing
divination problems of trying to predict which pages (I/O blocks) the
program will use next.  IMHO, this is a step backward.

<Barry credits the asynchronous nature of I/O on mainframe OS's to the
<access methods, like RMS on VMS.  People avoid those when they want
<speed (imagine using dbm to do sequential reads).  For instance, the
<VMS "copy" command bypasses RMS when copying disk-to-disk, with the
<curious result that it's faster to copy to a disk than to the null
<device, because the null device is record-oriented, requiring RMS.
 
RMS systems developed through evolution ("survival of the fastest?")
to their current state of being I/O marvels.  Hence MVS preallocation
requirements, VMS, asynch channel I/O, etc.

<As DMR demonstrates, parallel-transfer disks are great for big files.
<They're horrendously expensive though, and it's hard enough to find
<controllers that keep up with even 3 MB/s, much less 10 MB/s.

Disk prices are dropping fast.  8" 1 Gb dual-head disks (6 MB/s) will be
common in about a year for $5000-$9000 qty 1.  The ANSI X3T9 IPI
(Intelligent Peripheral Interface) is now a full standard.  It starts
at 10 Mb/s and goes up to 25 Mb/s in the current configurations.
N.B. the vendors pushing this standard are: IBM, CDC, Unisys, Fujitsu,
NEC, Hitachi, (big mainframe manufacturers).   Unix in its current
incarnation is unable to take advantage of this new disk technology.

<they can be simulated with ordinary disks by striping across multiple
<controllers, *if* the disks rotate as one.  Does anyone know of a cost-
<effective disk that can phase-lock its spindle motor to that of a second
<disk, or perhaps with the AC line?  With direct-drive electronically-
<controlled motors becoming common, this should be possible.  The Eagle
<has such a motor, but no provision for external sync.  I recall stories
<of Cray's using phase-locked disks to advantage.
 
The thesis in my paper "Turbo NFS" (*) shows how you can get good
I/O performance without phase-locked disks by reorganizing the
file system contiguously.  Cylinders of data are prefetched from
selected disks at a rate commensurate with the rate of which the
data is consumed by the program.   Extents are allocated contiguously
by powers of 2.  The organization is called a "fractal file system".
Phillip Koch did the original work in this area (**).

<Of course, to get the most from high transfer rates, you need large
<blocksizes; DMR's example looked like about one revolution.  Hence
<the extent-based file allocation of mainframe OS's, etc.  Perhaps
<it's time to pester Berkeley to double MAXBSIZE to 16384 bytes?

Berkeley should start over.  The whole business with "cylinder groups"
tries to keep sets of blocks relatively near each other.  With the new
disks today, the average SEEK TIME IS OFTEN FASTER THAN THE ROTATIONAL
DELAY.  You don't want to keep blocks "near" each other, instead you want
to make each extent as large as possible.  Sorry, but cylinder groups are
archaic.

<The one point that nobody mentioned is that you don't want the CPU
<copying the data around between kernel and user address spaces when
<there's a lot!	(Maybe it was just too obvious).

Here is an area where paged I/O has an advantage.  The first UNIX vendor
to do contiguous file systems + paged I/O + prefetching will win big in
the disk I/O race.
 
<Don Speck   speck at vlsi.caltech.edu  {amdahl,ames!elroy}!cit-vax!speck

(*) "Turbo NFS: Fast Shared Access for Cray Disk Storage", A. Klietz
(MN Supercomputer Center)  Proceedings of the Cray User Group, Spring 1988.

(**) "Disk File Allocation Based on the Buddy System", P. D. L. Koch
(Dartmouth)  ACM TOCS, Vol 5, No 3, November 1987.

--
Alan Klietz
Minnesota Supercomputer Center (*)
1200 Washington Avenue South
Minneapolis, MN  55415    UUCP:  alan at mn-at1.k.mn.org
Ph: +1 612 626 1836       ARPA:  alan at uc.msc.umn.edu  (was umn-rei-uc.arpa)

(*) An affiliate of the University of Minnesota



More information about the Comp.unix.wizards mailing list