4D-200 series hangs frequently

Sun Jun 30 13:21:45 AEST 1991

For what it's worth, we have a 4D/240 and a 4D/340 that hang
frequently too. (frequently == every day at the worst of times, once
every four days at the best; we've been waiting for them to pass the
canonical "can it stay up and working for 30 days" test for over a
year now :-) (Took our Sun4/280 a year and a half to reach this
blissful state, so this must be something about modern OSes) Both
systems run 3.3.1, neither has a graphics console.  Both have their
console serial lines wired to a Develcon Develswitch, so we can get at
them remotely when need be.  Both act as fileservers for diskless
Sun3s, have users login from our terminal server, and from X terminals
or workstations.

The 240 runs a non-standard Ciprico disk controller and driver, so
it's possible that the problems on it are our fault.  However, the 340
is standard SGI hardware and software, just a few kernel constants
(like streams buffers) cranked up (and a couple of streams code fixes
that solved some of the more frequent hangs)

Typical hang conditions are when the system has a dozen users or so --
the 340 usually has all four processors busy with crunch jobs in the
background, the 240 usually has a couple of processors idle.  Both
systems are frequently pushed to the limit, the 340 more often than
the 240 (the 240 hangs more often, though) which must be some part of
the problem, because other less loaded 240s and 280s around here have
stayed up for months on end.  

Both systems have a lot of NFS hard mounts, including cross-mounts.
We're well aware that an NFS server going down can hang them, but
there have been many hangs that cannot be explained by this.  (We also
mount NFS directories in /nfs/machine/filesystem to try to avoid some
of the problems)

I've seen some correlation between hangs and a home directory file
system filling up.  Not too conclusive, though.  (Both machines have a
reasonable amount of swap -- 200Mb or so; we're well aware that a
process filling up swap can degrade the system impressively as it does
its best to make the process dump core:-)

I know we ought to have reported this in more detail before, but we've
been embarassingly sluggish about collecting enough facts to make this
sort of report useful to the kernel folk we contact at SGI, and calls
to the hotline about this sort of problem produce, um, less than
helpful answers once we confirm that we have lots of space in /tmp, we
age logs regularly, and have lots of swap space.

	Mark.