SCO Xenix System Hang

Robert R. Kessler kessler%cons.utah.edu at wasatch.UUCP
Sun Dec 11 08:43:11 AEST 1988


We are having problems with SCO 386 Xenix and are looking for some help.

Here is the scenario:

Our customer is running on an IBM PS 2/80, 20 Mhz, with a Hostess
multiport board.  They run about 6 concurrent users.  We have
installed the latest version of Xenix (2.2.? -- I don't recall which
exactly).

Our applications are all written in RM/COBOL supported by Austec.

Our customer arrives in the morning around 6 am and starts using a
terminal or two.  By 9 am they are up to full strength running all 6
terminals.  When a user runs our software, they all login with the
same user id which starts executing our own user interface shell
(written in COBOL).  It emulates the user interface that we had on our
original system running on TI minis.  The user then selects the
program to run and our shell uses the COBOL CALL statement to call the
program.  I don't believe that it actually forks a process to do this,
though I might be wrong.  Typically, a user seldom logs all the way
out and just goes in and out of programs from our shell.  Some time
after 11:00 am, the display of the various screens all start to slow
down.  Instead of blasting the fields to the screen, it chunks a line
at a time.  If someone can get to a terminal with the login prompt,
they may be able to log in as shutdown and reboot the system.  All
user programs can usually save their data (the applications are all
doing data base like operations, using the key-indexed files provided
by COBOL -- we don't use any external data base facility).  However,
they cannot exit back out of the programs.  They just hang.

If they are successful at rebooting the system, everyone comes back up
and it works just fine.  If not, then the whole system hangs.  The
hang is interesting.  If you have a terminal sitting at the regular sh
prompt, you can type carriage returns and the prompt is echoed.  If
you do any command (ps, shutdown, etc) then it just goes away and
doesn't respond.  You can still type on the terminal, characters are
echoed, but nothing happens.  You can also switch to a different
screen and bring up the system prompt.  However, if you try to type on
this screen, nothing is echoed.

It seems to be related to the amount of work that gets done.  If they
then go for another three hours, it will crap out again.  Our
customers are currently rebooting at 11, 2 and 5 before going home and
running their nightly backup, upkeeping programs.  It is extremely
incovenient.  We have three other customers waiting for their systems,
but we dont want to send them until we can get this problem fixed.

We have contacted SCO a couple of times and Austec, but haven't had
any resolution of the problem (there seems to be some finger pointing
between the two).

Another data point -- I tried to simulate the problem and wrote a
program to CALL a couple of programs and exit, etc.  I eventually did
get code that would always cause the system to hang in exactly the
same way (but it doesn't need to do any calls).  However, I tracked
down the problem and it is some kind of record/file locking problem.
The program that eventually causes it to hang essentually opens, and
writes to a shared file.  It hangs randomly, when one terminal opens
or closes and another writes the file.  We guarantee that they don't
write concurrently to the same records, but it still shouldn't get
into a situation where it hangs the entire system.  The resulting
hang, acts just like the hang at our customer.  However, this hang can
happen in one minute or 3 hours.  It is entirely timing dependent, not
load dependent.  I believe that my program uncovers another bug, and
really isn't what our user is seeing (I tried rewriting the program so
the file isn't shared and installed it at the customer -- it didn't
help since we still have lots of shared files that are used in the
system).  Plus the circumstances of being time varying makes me
believe that it is a different problem, though the result is the same.
(BTW -- the buggy program was run on a COMPAQ 386/20 DeskPRO running
2.3.1 Xenix).

Any help would be greatly appreciated.

Can I write some logging programs to write useful information to a
file that we could examine after a crash?  Is there some system
parameter that I could tweak to alleviate it?

Thanks.
Bob.



More information about the Comp.unix.xenix mailing list