Is System V.4 fork reliable?

Fri Jul 27 09:47:25 AEST 1990

In <13426 at cbmvax.commodore.com> ag at cbmvax.commodore.com (Keith Gabryelski) writes:
>>In article <561 at oglvee.UUCP> jr at oglvee.UUCP (Jim Rosenberg) writes:
>>>Somewhere along in the development of System V, fork became an
>>>unreliable system call....
>Unreliable in what way?  fork() has always been documented to fail
>if there isn't enough system resources.

*** OPEN DRAGON MOUTH ***

What do you count as a resource?  Disk space is a resource.  If I run out of
swap space, I'm the system administrator, this is *my* problem, I don't go
complaining to Dennis Ritchie because I don't have a rubber disk.  I also *can
fix it*.  It isn't fun, but I can reconfigure my disk for more swap space.

Is physcial memory a resource?  Now it gets sticky.  The whole concept of
virtual memory is exactly supposed to be that I SHOULDN'T HAVE TO think of
physical memory as a resource; the operating system should handle that, as
long as what I'm asking for is reasonable.  If my system is so loaded that
there are no more process table slots, again, as system administrator that's
*my* problem.  And I can fix it.

But if system calls fail simply because of a very temporary bout of activity,
that is *not my problem*!  It's the kernel's problem.  At least it should be:
that's what "I'm paying the kernel for" so to speak.  And *I CAN'T FIX IT*
myself.  If utilities haven't been rewritten to do the right thing with EAGAIN
after fork(), and I'm only a binary licensee, what can I do about it?  (Except
climb the soap box in this newsgroup, of course!  :-))

>When assigning a process id the kernel will try to allocate a proc_t
>without sleeping [ie, pass KMEM_NOSLEEP to *alloc()].  If this fails,
>fork() will return EAGAIN.

OK, so we have the word, V.4 will have some of the same problems as V.3.

>What isn't reasonable about this?  fork() is documented (under SVR4)
>to fail returning EAGAIN if:

	[citation from the man page for fork(2)]

Come on now.  How many times have you seen a writeup on how to do things with
fork() in a UNIX magazine or book -- and how many of those times has the
author done *ANYTHING* at all with EAGAIN??  It is simply a fact that the need
to retry forks that have failed with EAGAIN is not widely embedded out there
in UNIX culture.  I have not seen one single writeup in print that really
discusses this.  My favorite text on programming UNIX system calls is Marc
Rochkind's book, Advanced UNIX Programming.  Maybe I have an old edition, but
here's what he says about it:  "The only cause of an error is resource
exhaustion, such as insufficient swap space or too many processes already in
execution."  No mention of the fact that if you just wait for the page-
stealing daemon to go to work everything will be fine.  Is it "reasonable" for
my application to have to sleep when in fact it should be the kernel that does
the sleep for me?  Why should I have to guess how long to sleep?  My
application doesn't have to sleep to wait for a disk block to become available
-- the kernel does this for me.  Why shouldn't it do the same thing for memory
pages?

What retry policy should one use?  How long should one sleep?  How many
retries?  Or even a larger question:  How come there's no consensus on this?
How come this isn't one of those FrequentlyAskedQuestions?

Does every V.4 utility that forks in fact *do something sensible* with EAGAIN?
(Since having the application deal with this is so "reasonable" ...)  Once
again, those of you out there with source can answer this one easily enough.

The larger question is *why can't the kernel sleep* when it needs more memory
for a fork???  It appears there is a risk of some kind of deadlock.  Whose
problem is *this*?  Mine as system administrator?  When I brought this subject
up the first time, someone posted that they had, in fact, hacked their source
code to allow sleep under V.3.  This person said he got away with it.  That
leads me to believe that in fact he was simply lucky, but that the race
condition or deadlock or whatever the problem is is mighty obscure.  Perhaps
too obscure for anyone to *FIX*.

My personal view is that a kernel whose only mutual exclusion mechanisms are
sleep-wakeup and spl() just makes it too complicated to really fix this and
allow sleep.  Once upon a time there was a consensus that the kernel needed to
be rewritten from scratch.  Once upon a time there was the famous V.5 skunk
works doing exactly this, and based on what Bill Joy told me once it sounds
like it would have dealt with this kind of problem.  But then the skunk works
became politically untenable after the OSF rebellion, so now we seem not to
hear much talk about how important it is to rewrite the kernel.  (Except from
OSF, CMU, who say they've already done it ... :-))

We only hear talk that it is "reasonable" to bother the application with the
fact that the system just happens to be kind of busy at the moment, but busy
with a problem on which it *should* be able to sleep ...

And for those of you wizards out there who write articles on C for such mags
as UNIX Review & UNIX World, please **TELL PEOPLE** about this issue!
-- 
Jim Rosenberg             #include <disclaimer.h>      --cgh!amanue!oglvee!jr
Oglevee Computer Systems                                        /      /
151 Oglevee Lane, Connellsville, PA 15425                    pitt!  ditka!
INTERNET:  cgh!amanue!oglvee!jr at dsi.com                      /      /