Catching termination of child process and system() call

Thu Feb 14 18:21:11 AEST 1991

(This really belongs in a Unix newsgroup; however, I expect no further
followups, i.e., I think this will be the decisive answer.)

In various articles (see the references line) Doug Gwyn and Norman Diamond
argue over the type of the argument to wait(2).

In article <1356 at geovision.UUCP> pt at geovision.gvc.com writes:
>Sorry to add to this 'did not- did too' level of discussion, but a 
>"man 2 wait" on several machines shows [both].

Although I am a known BSDite (`BSD pervert' to some :-) ), I have to side
with Doug here.

The mess came about for historical reasons.  In the days of Version 6
Unix, there was only one wait() system call; it took a pointer to int.
V6 begat V7 and PWB; PWB grew (via a long and convoluted path) into
System V while V7 grew into 32V and eventually to 4BSD.  (There were
various cross-fertilizations along the way, but by and large the systems
split apart sometime between V6 and V7.)

As Doug has already noted, certain persons who shall remain nameless---
not to protect the guilty, but rather, simply, because I am not certain
who---changed both wait() and wait3() at about the same time as job
control (and wait3() itself) were added to the Berkeley kernel.
(Wait() and wait3() were in fact the same system call, distinguished
by, of all things, the condition codes in the VAX PSL.  The whole setup
was a botch.  Fortunately, all is now repaired.)  Since wait3() could
and did return more information than did wait()%, it seemed convenient
to make a union describing the different return values.  While all this
went on, no one changed the kernel: the union was carefully tailored
to match the actual kernel code, which still used `int's.
-----
% Ignore that masked ptrace() behind the curtain
-----

Because the kernel was unchanged, the fields in the union were byte
order dependent.  When 4.3BSD was ported to the Tahoe, a big-endian
machine, our industrious kernel hackers added byte-order macros and
made use of them in defining the wait union.  This made the same names
work on the two different machines.  Unfortunately, the resulting union
definition was still not right: the byte order of any given machine
does not uniquely determine the bit order of that machine.  With the
advent of POSIX our industrious kernel hackers finally gave up, sighed,
and replaced the union with accessor macros.

Meanwhile, on all those machines that still use the old Berkeley union,
it `just happens' (for the reasons given above) that `int's also work.
New machines that conform to POSIX standards will use `int's.  Therefore,
all new software should use `int's.  The new Berkeley <sys/wait.h> will
still work with old software as well (there is some hackery in the
accessor macros to accomplish this).

The answer, then, is that to wait for a process whose id is `pid' you
should use:

	int w, status;

	if (check_other_wait_results(pid, &status))	/* if necessary */
	while ((w = wait(&status)) != pid) {
		if (w == -1 && errno == EINTR)	/* ugly but sometimes... */
			continue;		/* ...necessary */
		record_other_wait_result(w, status);	/* if necessary */
	}

The exit status of the process, if any, is then `status >> 8' and the
signal, if any, that caused the process to die is then `status & 0177'.
The process left a core dump (`image' or `traceback data' to non-Unix
folks) if `status & 0200' is nonzero.  This *will work* on systems
that currently have the union.  It will draw warnings from lint, but
then, lint does not know *every*thing.
-- 
In-Real-Life: Chris Torek, Lawrence Berkeley Lab EE div (+1 415 486 5427)
Berkeley, CA		Domain:	torek at ee.lbl.gov