restarting processes

Robert Scott zeke at shamash.cdc.com
Wed Sep 5 00:17:19 AEST 1990


In article <1990Sep3.235815.17361 at wrl.dec.com>, vixie at wrl.dec.com (Paul Vixie) writes:
> I'd like to do this also.  But if your process has pipes open to other
> processes, then those other processes would have to be restarted in the
> same state if your process was to be restarted "correctly".  If you had
> files open, those same files would have to be there when you restarted,
> with the same contents.  If you had a physical device file open, the
> results could be confusing (let's say someone else dismounts your tape
> and mounts one of their own -- can you get your tape back to the same
> "state" it was in when you restart your program?).  And of course, if
> you had any network connections open, then all of this stickiness extends
> to whatever processes you're talking to on (the) remote machine(s).
> 
> This kind of restartability wasn't on the UNIX designers' minds, and the
> system call interface has absolutely no architectural support for it.
> The thing you're trying to do is usually done at the application layer,
> as in "commit" operations in databases, and like that.
> 
> Stuff deleted...

On most Control Data machines running NOS or NOS/VE, and the old Cyber 205
supercomputer, there is a facility called "checkpointing" the system.  When
the operator does this, the state of all running processes are saved complete
with open file info and everything.  After the checkpoint, the system can be
brought down for maintenance or whatever, and then restored to the initial
running state by reloading the system and going through a "restart" process
to reload and restore the executing jobs.

I believe that on the Cyber 205 we could also checkpoint individual jobs.  Big
difference between UNIX and VSOS (205 OS) though, was that each 205 job is
almost always a single process unless it is a system task.

As Paul writes above, UNIX contains many possible problems to this kind of
operation.  Remember, UNIX was written basically as a small computer OS for
interactive access, and wasn't originally intended to be running weather 
models or other large programs that might have to run on a supercomputer 
for 24 hours before completing.  On the large mainframes, particularly in 
the scientific computing arena, huge data reduction or repetative calculation
are the norm, as is batch input/output.  Just as a course of normal operations
in these giant pieces of iron, programs and entire OS states need to be saved
so that the machine can be serviced or a higher priority program run.

Checkpoint on UNIX would be a nice idea, though.


Zeke

~~~~~~~~~~~ From the Shrine of the "Last Gasp of ETA Systems" ~~~~~~~~~~~~~
Extra zesty disclaimer:  MINE! MINE! ALL MINE! <chortle snort froth drool>
Robert K. "Zeke" Scott        internet: zeke at eta.cdc.com
Control Data Corp, Supercomputer Support Group
-- 
~~~~~~~~~~~ From the Shrine of the "Last Gasp of ETA Systems" ~~~~~~~~~~~~~
Extra zesty disclaimer:  MINE! MINE! ALL MINE! <chortle snort froth drool>
Robert K. "Zeke" Scott        internet: zeke at eta.cdc.com
Control Data Corp, Supercomputer Support Group



More information about the Comp.unix.internals mailing list