Children's exit() status

Tue Feb 23 03:13:02 AEST 1988

  Okay UNIX Sys V hackers, here's a question for you.
  In the following scenario, how should a parent process
  wait for it's children to complete:

REQUIREMENT:

  I have a parent process who forks 30 identical children.
  The children conduct some measurements, and when done,
  each sends a single IPC message with results back to the
  parent and exits.

  The children are identical, so they should all have roughly
  equal life span, though that time may vary between 5 and 15 minutes.

  The parent needs to be woken when the first child exits --
  a straight forward wait().  The parent must also know if
  any children complete in error.

  It is preferable that the parent check the children's exit status
  for any errors, since the system may indicate strange situations in
  the exit status, and the children are already designed to use exit(code).

POSSIBLE SOLUTIONS:

  Here's what I've though of so far:

  There seem to be 2 types of solutions, either use wait() with or
  without SIGCLD, or use blocking message receives.

  I'd like to use wait(), because the children have a meaningful
  exit status.  The question is, is it possible that my program
  be woken up only 20 times, for 30 children.  Ie. could I miss
  child deaths because several occur "simultaneously".  (simultaneously
  meaning while I'm awake checking one child's return code, another
  2 children die -- the next wait() missing one or both of them.)

  If I *do* miss children deaths, then upon each wake up from wait,
  I could kill(pid, 0), each of the children to see if they're all dead.
  I wouldn't miss any deaths that way, but I'd still miss some exit codes.

  If I'm going to miss exit codes, I could use signal(SIGCLD, SIG_IGN)
  after the first child's death to wait() for the last child's death.
  Then I'd check to see if I have 30 messages waiting.  There are
  warnings about using this signal in signal(2), so this is no good.

  Another possibility is to have the children send a software signal
  to the parent just before they die.  I wouldn't miss any deaths,
  but this is no help with exit codes.

  Another solution is to use vanilla blocking message receives.
  I know how many children I have, and could expect that number
  of messages.  I'd have to change the children to not send a message
  if they encountered a problem -- the message in effect acting as
  a "normal" return code.  However, error codes from built in exit()s
  would be lost, unless redesigned to send the code in a message before
  exiting.  I'd also lose any system information encodes in the exit code.

Has anybody out there run in to this type of situation?
Any facts, clues or pointers appreciated.  If you reply,
please cc: email since I don't often read news.  Thanks.

Len Brown
201-949-0092
{ ihnp4 etc. }!houxs!lenb