UNIX semantics do permit full support for asynchronous I/O

A. Lester Buck buck at siswat.UUCP
Sun Sep 2 17:48:48 AEST 1990


In article <1990Aug30.222226.20866 at cbnewsm.att.com> lfd at cbnewsm.att.com (leland.f.derbenwick) writes:
>In essentially any serious database application, a completed
>write() to a raw disk is treated as a guarantee that the data
>block has been _physically written to the device_.  (This is
>needed to ensure reliable transaction behavior in the presence
>of potential system crashes.)  Since your suggestion would void
>that guarantee, it is not benign.

Close, but not quite.  The guarantee is that the _controller_ has accepted
the data.  If/when the bits actually hit the media is not fully under the
control of the OS.  Remember SCSI has a READ BUFFERED DATA command for error
recovery.  SCSI disks are coming with bigger caches all the time, and a
power hit can take out a significant amount of data.  If the database really
must remain consistent, a UPS is probably required.

As to Steve's idea, it has a certain elegance to recommend it.  But its
practical value is low.  Sure, it can be made to have full Unix semantics,
but at the price of the common case reducing almost exactly to synchronous
I/O.  Or imagine the case of an I/O server process sharing memory
with dozens of clients.  Each shared memory segment will have to keep a
list of every process that must block on a page fault. The practical effect
will be that an _arbitrary_ number of processes will potentially block
for every I/O, instead of doing useful work in their own address spaces.
This scheme falls into the general class of YANSUAIOM (Yet Another
Non-Standard Unix Asynchronous I/O Mechanism), as do the schemes with
ioctl's or select'ing on disk.

What may be difficult to understand at this point, when Unix has not had a
standard asynchronous I/O facility, is that we will program _differently_
when it is widely available.  The semantics of I/O must change (broaden).
The structure and flow of a program will be significantly different when it
uses asynchronous I/O, in the same way that the availablility of real
threads leads to new programming paradigms to take advantage of those
facilities.  We may have to look at schemes used in the realtime Unix
versions, VMS (gag) or even MVS (gag!!), which have had asynchronous I/O
facilities for up to decades, to adapt to this new mindset.

The only reason one designs an asynchronous I/O facility is to efficiently
overlap computation with I/O transfers, and that can take some careful
thought to achieve maximum speedup.  For example, Chris Torek recently
traced the path of a raw synchronous I/O, which eventually sleeps in
physio() in the context of the calling process.  A large transfer will loop
through physio, with a wakeup/sleep cycle for every chunk (limited by how
much physical memory the OS wants to lock down at once).  Each sleep/wakeup
cycle is an expensive context switch, involving reloading the virtual memory
state of the caller.  But a fully asynchronous I/O scheme drags along enough
state to start the next I/O chunk all within the driver interrupt routine,
with the calling process completely out of context.  Of course, it is a
bit(!) more complicated if non-resident pages are found in the next chunk
that needs to be page-locked...

The POSIX.4 asynchronous I/O facilities are moving toward final ballot and
present a rich set of asynchronous I/O primitives.  These include the
obvious aread/awrite, and listio, similar to readv/writev for synchronous
transfers, which can fire off a large number of aio's at once and optionally
be notified only when they are all complete.  Iosuspend is a more advanced
version of select that waits for completion of any operations in a list.
The process can learn of I/O completion in at least four ways:  1) return
codes written into the process' asynchronous I/O control block, 2) receiving
a completely asynchronous "fixed" (queued, tagged) signal/event which runs a
handler, 3) synchronously suspending for I/O completion (iosuspend), or 4)
synchronously suspending or polling for the signal/event posting I/O
completion.  [Suspending is familiar, but the committee added polling, where
a process can sleep until one of a selected signal/event class is posted
while taking signal/events not being polled for completely asynchronously.]

-- 
A. Lester Buck    buck at siswat.lonestar.org  ...!uhnix1!lobster!siswat!buck



More information about the Comp.unix.internals mailing list