UNIX question

Thu Dec 12 09:49:00 AEST 1985

> > My question: Is there any way to kill off these zombies so I can get
> >          more processes ?  Or, failing that, is there any other
> >          way to do what I want ?
>
>               ...
>
> Or, you could keep track of the child PIDs and probe
> their state every so often via kill() with "signal" 0,
> waiting on those that return failure from the kill().

This will work on 4.x-based systems, but not on most others.  Kill does
not support "signal" 0 in many earlier systems.  In System III and V,
kill does support "signal" 0, but does not fail on attempts to send
signals to zombies.

> A clean way to handle this problem on Sys3 was to use the following
> system call in the parent process:
>       signal(SIGCLD, SIG_IGN);
>
> Then when a child process exited, a zombie would not be created.

This applies to System V as well.  It is not, however, part of the SVID.

> Is SIGCLD always reset to SIG_DFL on exec?  If not, since ignored
> signals normally remain ignored, it could break other programs
> which expect to collect children; and programs that ignore SIGCLD
> would have to carefully un-ignore it just after forks.

SIGCLD is not reset from SIG_IGN to SIG_DFL on exec.  Yes, this
means that programs which ignore it need to be careful before spawning
other programs.  The same is true, by the way, of programs which mask
out signals in BSD systems.

>                         In V7, 3BSD, and 4BSD, and I suspect also
> in Sys III and V (and Vr2 and Vr2V2), and probably in V8 as well,
> signals are not queued, and without the `jobs library' of 4.1BSD,
> or the signal facilities of 4.2, this code cannot be made to operate
> reliably.  It *will fail*, someday, no doubt at the worst possible
> moment.
>
> The problem is that several children may exit in quick succession.
> Only one SIGCLD signal will be delivered, since the parent process
> will (just this once) not manage to run before all have exited.
> The sigcld handler has no way of determining how many children are
> to be processed.

It turns out that SIGCLD can be used reliably in System III and V.
What is missing from the example is a call within the signal handler
to re-install itself.

>       int
>       sigcld()
>       {
>               int pid, status;
>               pid = wait(&status);
>               ...
>>>             signal(SIGCLD, sigcld);         /* add this line */
>       }

The signal(2) system call checks to see if any zombie child(ren) are
present and sends the calling process another SIGCLD if there are.
The signal handler is thus invoked recursively, once per zombie.
Note that the reinstallation of the handler must follow the call to
wait, or infinite recursion results.

Unfortunately in System III SIGCLD was not reset-when-caught, so this
call might have been left out, allowing children to be missed. This
was changed in System V; SIGCLD is reset to SIG_DFL when caught.  Note
that there is no loss of reliability from the reset to SIG_DFL; since
SIGCLD is ignored by default, this is equivalent to masking out the
signal until the handler is reinstalled.  Unfortunately both System III
and V fail to document these semantics of signal(2), and instead have
an incorrect explanation on the signal(2) page which states that SIGCLD
signals are queued internally.  We at HP implemented some systems (HP9000
series 500 releases <= 4.02) which queued the signals as AT&T documents;
current HP systems are all compatible with the System V code.

BTW, I find BSD's wait3 with WNOHANG to be a more intuitive mechanism.

                        Bob Lenk
                        {hplabs, ihnp4}!hpfcla!rml