killing a process gone bad.

Neil Rickert rickert at mp.cs.niu.edu
Fri Nov 2 04:52:43 AEST 1990


In article <1119 at massey.ac.nz> GEustace at massey.ac.nz (Glen Eustace) writes:
>We recently had the exact situation described in the previous
>posting.  There was a little more code involved but the net effect
>was the same.  All attempts to clear out the system failed as there
>was no spare CPU available to allow remedial action to be taken.  The
>problem was cured by a reboot.
>
>Following our problem, the perpertrator posted to comp.unix.questions
>to find out what we could have done.  We received various replies
>including the 'kill -9 -1' variety.
>
  We have 10 processors.  Simple killing of replicating processes never
works, because more are created as fast as old ones are killed off.  I
regularly see students who inadvertently create the problem, and finish
up running out of processes (the local per-user limit is 50).

  I have NEVER had to reboot to resolve this problem.  My experience is
with a BSD system, so may not apply to SysV.

  Here are three simple approaches to try:

  (1)	The simple-minded approach.
	Look for a file which the programs depend on.  Try removing or
	renaming that file.  In particular, if the replicating process
	seems to be a shell script, look for a shell script in the user's
	directory named 'test'.

  (2)	The slow and tediou method.
	This is a method I sometimes ask the student and/or his instructor
	to use.  It is somewhat slow, as it requires killing all the processes
	individually.  It usually works.

	Step 1.  Find a list of the bad processes.  If the student is doing
	this himself, he can ask a friend on a different account to do a
	'ps uax|grep user' for this purpose.  Failing that, he should be
	able to login, and then used 'exec ps ug'.  This will give the list
	of processes, but log him out again.

	Step 2.  Armed with a list of process IDs, start killing them with
	the STOP signal.
		exec /bin/kill -STOP pid pid pid ...
	The idea is to prevent further replication, but keep the processes
	in place so that you are always at the limit.  This step, and Step 1
	may have to be repeated several times to stop them all.

	Step 3.  Start killing the STOPPED processes.  To do this you
	will need the output of 'ps l'.  You must not kill a child before
	killing the parent.  Killing the child may cause the parent to
	wake up, and go back to its errant ways of replicating itself.
	Most of the time when you see this some of the processes have
	process 1 as the parent ID.  The procedure is to kill all of the
	errant processes whose PPID is 1.  Keep repeating this step till
	they are all gone.  Usually this becomes easier as you proceed,
	for you stop getting the 'out of processes' message after a killing
	a few, and no longer need to 'exec /bin/kill' and relogin after every
	try.

  (3)	The brute force method.
	I posted a script to do this recently.  It was posted as article
	<1990Oct26.140851.11707 at mp.cs.niu.edu>.  Read that article for
	full information.  It requires that you be root to execute it,
	and it requires that the perpetrator's login shell be 'csh'
	(because 'kill' is then builtin and doesn't require a new process).

	The basic idea is 'blocking'.  You keep the number of processes at
	the limit, so as to prevent further replication.  The script does
	the following:
		for each errant process
			create a new process (/bin/csh) for the user.
			kill the errant process
			the new process exec's to 'sleep 10 minutes' so
			 as to be relatively harmless.
	If the processes are dying as well as replicating, my script may
	need to be rerun a few times.  But, regardless, it soon creates
	enough sleeps under the userid that further replication of all
	errant processes is impossible, so they either all die out
	naturally, or sit around long enough to be killed.

 I have thought of rewriting the script as a C-program.  It would be SUID,
so that anyone could use it.  Basically it would allow a user to type
'exec superkill' to kill all of his processes.  I have never bothered to do
this because the problem does not seem to crop up often enough to go to
the trouble.

-- 
=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=
  Neil W. Rickert, Computer Science               <rickert at cs.niu.edu>
  Northern Illinois Univ.
  DeKalb, IL 60115.                                  +1-815-753-6940



More information about the Comp.unix.misc mailing list