Trouble killing processes

Tue May 17 20:59:34 AEST 1988

In article <1037 at unmvax.unm.edu> mike at turing.UNM.EDU.UUCP (Michael I. Bushnell) writes:
>In article <6832 at swan.ulowell.edu> arosen at hawk.ulowell.edu (MFHorn) writes:
>>Would a program that does the following get rid of the process?
>>
>>1: Gets the process' proc struct from the kernel.
>>2: Changes fields like the status, priority, cpu usage, wchan, exit status
>>   and maybe others so the kernel will have good reason to terminate the
>>   process.
>>3: Writes the new struct back out (open /dev/mem for write, lseek, write).
>
>Ack! no!
>
> [If you kill it, you may lose the resource for good (no process to release
> it).  You could try figuring out what the resource is and unlocking it
> yourself.]

[Good, I thought this had died..]

The two instances I've wanted such a program were the one I mentioned, a
user permanently 'allocating' processors in a parallel machine.  The other
is a tape drive getting hung (which happens a little too often), making
backups impossible until you reboot.

The idea behind nuke(1) would be to talk the kernel into letting the
process exit.  Changing fields in the proc (and maybe user) struct like
it's priority (send it through the roof), cpu time used (the roof),
status and exit status (hey, this process already died?!), maybe pointing
wchan to null, sending it a SIGHUP (and making sure it's not catching or
ignoring SIGHUP), etc.

Chris Torek and Guy Harris both said in the mail that a nuke program
could probably work, but it would also likely nuke the system.

>your program even smarter, and have it figure out just what things were
>locked and unlock them, but remember, they may be partially modified,
>and fixing them makes this an even more daunting prospect.

Any ideas on how to release the resource?  Or even on how to find it?
In the tape drive and processor examples, either method of attack (kill
it dead, and resource preemption) should work safely, if they work.

The real fix would of course be in the kernel.  I would suggest setting
a timeout on each system call.  This way, an lseek on a dead tape drive,
say, would fail after n secs of cpu.  Some sort of context might need
to be saved before the syscall starts, so things can be restored.  This
could be expensive.  Comments?

Andy Rosen           | arosen at hawk.ulowell.edu | "I got this guitar and I
ULowell, Box #3031   | ulowell!arosen          |  learned how to make it
Lowell, Ma 01854     |                         |  talk" -Thunder Road
                   RD in '88 - The way it should be