SCO Xenix System Hang

Mon Dec 19 14:33:17 AEST 1988

The described "hang": having the system run very slowly, but having
different users use different logins fixes or reduces the problem,
is caused by the per-user-id process limit.

There are several features that contribute to this:

1.  There is a limit to the number of processes a non-root user may
    have running at one time, called MAXUPRC.  The default is probably
    something like 15.  If you can re-link the kernel, you can probably 
    raise this limit.  (I am working from an old version (non-*86) of
    SCO Xenix System III, so some of this may have changed).  Raising this
    limit does not increase the size of any tables.  If you need lots of
    processes (and this problem will exist regardless of what uid's they
    run under), you may need to increase the number of entries for processes,
    open files, and inodes.  If you run out of open files or inodes, you
    get cryptic console messages like "no file".  If you run out of
    system processes, you may get error messages (csh), or just lots of
    retrying (sh).

    Fix:  don't run everything under the same uid, and/or raise MAXUPRC.

2.  If the "sh" shell gets an error on a "fork" due to running over the
    MAXUPRC limit, it retries.  Forever, unless interrupted by a signal.
    For a quick test of this, log in, then type sh<RETURN> repeatedly.
    After about 15 or 20 times, you won't get another prompt.  Use your
    interrupt character to unlock the terminal.  Then type lots of
    control-D's to get rid of all those shells.  

    Now, imagine three users logged in under the same user id.  Each
    has 5 processes, and is trying to create a 6th with sh.  None of them 
    will get any work done until one of them aborts the retries and
    terminates that shell.  Whether or not the interrupt character can
    do this, and whether trying will destroy data, depends on the application.

    The scenario this retrying is supposed to handle is waiting for another,
    independent job started from the same terminal to release its processes
    (say, a mailer doing background delivery) without requiring more.

    Fix:  Try to arrange jobs to not require deep nesting of processes.

3.  To further complicate the situation, some applications don't wait for
    their children.  A process occupies a process slot until its parent
    (or if its parent dies, its foster parent, process 1) waits for it.
    If an application keeps spinning off background print jobs, and never
    waits for them to finish, eventually it will hit MAXUPRC or run the
    system out of processes.  These will show up on a ps as zombie processes 
    with parents other than process 1.

    A compromise for this might be to allow one outstanding background
    job, and after spinning off the second one, wait for one of them
    to finish.  Also, doing a wait() with an alarm() set can pick up
    already-terminated processes without waiting for all of them.

    Fix:  applications should wait for their children.  (Also check
    the status and report problems!)

4.  This one gets a little exotic, and may be specific to the system I
    am using.  It doesn't apply to systems that do paging instead of swapping.
    It also isn't related to running everything under one user id.
    There is a limit to the maximum amount of memory a process can use at 
    one time in the kernel.  Suppose that the kernel is re-configured to raise 
    this limit to above the amount of memory available.  (Limit > physical
    memory - kernel memory.  "Available memory" means the maximum amount of
    memory a process can get without hanging the system, after administrative
    restrictions are raised.) Now, have an application program request 110% of 
    available memory.  This request will fail.  Have an application request 
    100% of available memory plus one allocation unit.  This request doesn't 
    fail (but it should).  The process gets swapped out and tries to swap
    back in.  In the process, the swapper swaps everything else out.
    You can't kill the huge process because it needs to swap in to die.
    Something else running may lock up behind this process, or it may
    run, but slowly because it keeps getting swapped out.

    The fix for this is to not let processes get away with requesting that
    much memory.  The easiest way is to lower the "administrative limit"
    maxprocmem.  This may not be present in System V, or it may exist in
    another form.
					Gordon L. Burditt
					...!texbell!sneaky!gordon