4.2bsd kernel auto-nicing, scheduling

Fri Mar 7 02:33:56 AEST 1986

I've just been catching up on unix-wizards, which I haven't looked at in
a loooong time, and I found this discussion of 4.2 scheduling.  
Interestingly, the Maryland stuff sounds *very* close to the 
(rather heuristic) scheduler in CP (the 'monitor', if you will) of VM.
It (basically) has:
1) a starting priority for every user
2) no explicit upper/lower limits per user, but certain other constructs
   generate something like a probability cloud around that starting 
   priority
3) a number of queues that users reside on.  Aside from the ones that 
   designate resource unavailability (notably storage overcommitment),
   there are basically 3 queues for 'runable' users, called (with typical
   lack of imagination) Q1, Q2 and Q3.  They correspond to very interactive,
   somewhat interactive, and batch (by batch I mean it runs to completion
   once started, not that you send it off elsewhere).  In the HPO
   (High Performance Option) variant of the product, the issue is clouded
   (but vastly improved, I'm told...I'll shortly be able to report on this)
   by the splitting of each of these into waiting-for-CPU-dispatch and
   waiting-for-other-quick-services (page fault, etc) queues.  They are
   informally called the real-run-list and the run-list, respectively.
   I can't recall the formal names right now.  Interrupts move users
   between the run-list and the real-run-list, and only the real-run-list
   need be scanned on most redispatches.  
4) A transaction can be loosely defined as the time between pressing the
   enter key, and getting the last of the results.  (for those who don't
   know, S/370 moves terminal interaction (irrevocably) into front-end
   hardware, so there is no raw-mode equivalent.  The entire input is
   presented to the mainframe in one interrupt when it is completed.
   Completion is signalled by a special key (enter).  Some (limited)
   local editing is possible before hitting enter.)  When a transaction
   is started, it is given a 'deadline' based on the basic priority of
   the user, some acknowledgement of the load average (called the expansion
   factor in CP), and an initial (recent-history-based, I think) 
   characterization of the transaction.  After that, the run lists are
   sorted and serviced in 'deadline' order.  This effectively prevents
   indefinite postponement (but, as pointed out in earlier postings, 
   indefinite postponement strategies are rendered ineffective when 
   faced with inadequate hardware).
5) While a transaction is 'runable', it is thrown into the pool and gets
   time-slices in rough proportion to it's deadline priority.  The short-
   term scheduler strategy is a modified round-robin (I think someone once
   called it a round-turkey :-) ).  If a transaction sucks up enough 
   resource, the scheduler decides it was wrong to characterize this 
   transaction as interactive, and it moves it to Q2 (which also has the
   effect of recalculating it's deadline, because it has now shown itself
   to be less interactive).  A similar shift can occur Q2->Q3.  Living in
   'lower' queues has the effect of getting you larger slices of resource,
   but less frequently.
There's more, but you get the point (and this is getting long).  One of
the things I found amazing (coming from a background with more 
sophisticated schedulers) was how well this all works, in the face of
tremendous loads.  We used to run 3 large UTS systems (a 370 UNIX-like
system) and a smattering of little CMS users (mostly development and
maintenance) on our system.  Even when the machine ran flat out, editing
response for CMS users was uneffected by the load.  
(There were other problems, but they basically weren't CP's fault, and
are really another story).
As another example, we recently disconnected one channel path to DASD
(in order to provide a test path for a new system being build concurrently
with production).    That cuts our access to disk in half.
No one seems to notice in their editing response.  I can see it in the
performance reports, but it's minimal.  I can also now flog the system
with artificially constructed workloads and actually effect other users
(something I basically couldn't do when both channels were connected).
I think we're seeing the effect of inadequate hardware there, though, and
not bad scheduling decisions.

One place where this all falls down is that the scheduler basically makes
only CPU dispatching decisions. It has no influence on I/O queue decisions,
nor paging queue decisions.  This all works if you have overdesigned your
I/O configuration relative to the rest of the machine, but would fail 
badly otherwise.  This is relatively easy to do with 370 arch. I/O gear,
but (seemingly) much harder on VAXEN.  I'm curious how this is compensated
for in scheduling for 4.2

Well, enough for now.  I don't know how interested people on this list are
in hearing about work being done on other systems (especially blue ones :-)).

Michael Wagner (wagner at utcs)