unbatcher

Tue Apr 22 06:21:16 AEST 1986

From: Jeffrey Mogul <mogul at su-gregorio.arpa>

I think I know why the unbatcher hangs [Note: consequences of this is
that after a few hours, there are so many unbatchers running that the
"news" user is not allowed to run any more processes.  The way "rcp"
works is to cause a shell to be exec'ed at the destination end, the
shell forks a copy of "rcp", and apparently hangs at that point
because the system won't allow "news" to fork any more processes.
I suspect the shell either doesn't check the return from "fork()",
or more likely is busy-waiting until fork() works.  One note: if you
say in /etc/passwd that news uses csh, not sh (the default), then
I think your connection doesn't hang.  But I'm not positive.]

Anyway, back to unbatch hanging: what happens is that unbatch causes
/usr/bin/rnews to run.  That script renices itself to +10; I think
/usr/lib/rnews does the same thing because when I took that line out
of the script, rnews still ran at nice 10.

On navajo, at least, we have users who run compute-bound jobs
that last for days (last week, one turkey ran the same program 5
times in parallel).  They start at nice 0 and the system demotes
them to nice 4 after a while, but the nice 10 rnews never gets
anywhere.  If you renice the rnews to 0, or renice the compute-bound
jobs to 11, then rnews runs fine and things progress.  Alas, since
the rnews program is execed for every message, the former solution
(renicing rnews) is impractical because you have to do it once
per message.

I think rnews shouldn't renice itself.  If you are worried about
overload, have unbatch exit if the load is above some threshold.
The current situation leads to extreme resource stress.

Also, I think the news distribution setup needs better flow control.
Glacier creates a copy of each article that is batched but not
yet delivered to all receiving sites.  Since some sites (e.g.,
ISL) are often down for days, this means that Glacier can potentially
have double copies of several days of new news.  If Glacier has in
addition been recently reconnected to DECWRL after a few days of
that link being down, the resulting pulse of new news can soak
up >10Mb of disk space.  The batching script on Glacier checks
for space before creating the batch, but if things constipate then
that space is permanently tied up, and the 1500K yellow zone
can be swamped by a day or two of incoming news.

I think we are soon going to be spending more time processing news
than it deserves.  I hope Greg can get the bugs out of his NFS
kernel soon; then we should try having a few uVaxen (one in CIS,
one on MJH, one at SUMEX?) maintain the news, and everyone else
just remote-mount /usr/spool/news so that the timesharing hosts
don't waste their time or disk space on this.

-Jeff