SCO Xenix System Hang
Gordon Burditt
gordon at sneaky.TANDY.COM
Mon Dec 19 14:33:17 AEST 1988
The described "hang": having the system run very slowly, but having
different users use different logins fixes or reduces the problem,
is caused by the per-user-id process limit.
There are several features that contribute to this:
1. There is a limit to the number of processes a non-root user may
have running at one time, called MAXUPRC. The default is probably
something like 15. If you can re-link the kernel, you can probably
raise this limit. (I am working from an old version (non-*86) of
SCO Xenix System III, so some of this may have changed). Raising this
limit does not increase the size of any tables. If you need lots of
processes (and this problem will exist regardless of what uid's they
run under), you may need to increase the number of entries for processes,
open files, and inodes. If you run out of open files or inodes, you
get cryptic console messages like "no file". If you run out of
system processes, you may get error messages (csh), or just lots of
retrying (sh).
Fix: don't run everything under the same uid, and/or raise MAXUPRC.
2. If the "sh" shell gets an error on a "fork" due to running over the
MAXUPRC limit, it retries. Forever, unless interrupted by a signal.
For a quick test of this, log in, then type sh<RETURN> repeatedly.
After about 15 or 20 times, you won't get another prompt. Use your
interrupt character to unlock the terminal. Then type lots of
control-D's to get rid of all those shells.
Now, imagine three users logged in under the same user id. Each
has 5 processes, and is trying to create a 6th with sh. None of them
will get any work done until one of them aborts the retries and
terminates that shell. Whether or not the interrupt character can
do this, and whether trying will destroy data, depends on the application.
The scenario this retrying is supposed to handle is waiting for another,
independent job started from the same terminal to release its processes
(say, a mailer doing background delivery) without requiring more.
Fix: Try to arrange jobs to not require deep nesting of processes.
3. To further complicate the situation, some applications don't wait for
their children. A process occupies a process slot until its parent
(or if its parent dies, its foster parent, process 1) waits for it.
If an application keeps spinning off background print jobs, and never
waits for them to finish, eventually it will hit MAXUPRC or run the
system out of processes. These will show up on a ps as zombie processes
with parents other than process 1.
A compromise for this might be to allow one outstanding background
job, and after spinning off the second one, wait for one of them
to finish. Also, doing a wait() with an alarm() set can pick up
already-terminated processes without waiting for all of them.
Fix: applications should wait for their children. (Also check
the status and report problems!)
4. This one gets a little exotic, and may be specific to the system I
am using. It doesn't apply to systems that do paging instead of swapping.
It also isn't related to running everything under one user id.
There is a limit to the maximum amount of memory a process can use at
one time in the kernel. Suppose that the kernel is re-configured to raise
this limit to above the amount of memory available. (Limit > physical
memory - kernel memory. "Available memory" means the maximum amount of
memory a process can get without hanging the system, after administrative
restrictions are raised.) Now, have an application program request 110% of
available memory. This request will fail. Have an application request
100% of available memory plus one allocation unit. This request doesn't
fail (but it should). The process gets swapped out and tries to swap
back in. In the process, the swapper swaps everything else out.
You can't kill the huge process because it needs to swap in to die.
Something else running may lock up behind this process, or it may
run, but slowly because it keeps getting swapped out.
The fix for this is to not let processes get away with requesting that
much memory. The easiest way is to lower the "administrative limit"
maxprocmem. This may not be present in System V, or it may exist in
another form.
Gordon L. Burditt
...!texbell!sneaky!gordon
More information about the Comp.unix.xenix
mailing list