UNICOS 5.0 Memory Scheduler Problems
Gary Smith
gary at ut-emx.UUCP
Sat Oct 7 04:26:08 AEST 1989
Following is the text from a message I posted to the unicos-l mail
group approximately two weeks ago. Those of you who read both the
unicos-l information and this information are probably quite tired
of seeing it. Nonetheless, as there could be interested parties
who read this news and not unicos-l, I wanted them to have an op-
portunity to read it. CRAY Research is continuing to work on the
problems discussed in the paper and indications are that a new
memory scheduler will be included in UNICOS 6.0.
---------------------------------------------------------------------
Following is the text from a paper I presented at the CRAY User Group
Conference in Trondheim last week. The text is quite long but I felt
it important that readers of this mailing group be made aware of our
experiences with UNICOS 5.0 process scheduling. Lothar Wollschlager
of KFA (BITNET: ZDV026 at DJUKFA11) has had similar experiences on their
Y-MP8/832. The most important section is 2.2. CRAY Research has
already addressed many of these problems, but I believe a redesign of
UNICOS memory scheduling is nevertheless required.
UNICOS 5.0 Scheduling Problems on a CRAY X-MP
EA/14se
Gary Smith
Internet: g.smith at chpc.utexas.edu
BITNET: G.SMITH at UTCHPC
DECnet: UTCHPC::G.SMITH
The University of Texas System
Center for High Performance Computing
Balcones Research Center
10100 Burnet Road
Austin, Texas 78758-4497
September 1, 1989
ABSTRACT
Process scheduling algorithms that achieve ade-
quate batch job throughput while providing acceptable
response time for interactive jobs require the careful
integration of swapping-store I/O scheduling, central
memory scheduling, and CPU scheduling. Direct, effi-
cient couplings among these scheduling algorithms are
very critical for modest-resource supercomputers such
as the CRAY X-MP EA/14se. They must maximize user CPU
and I/O utilization by maximizing the degree of mul-
tiprogramming, within any real-time and/or political
constraints.
As released, the UNICOS 5.0 process scheduling
subsystem has proved ineffective at maximizing user CPU
and I/O utilization for typical job distributions on
our CRAY X-MP EA/14se. This is not because our modest
UNICOS hardware platform is intrinsically inadequate to
achieve acceptable performance for our predominately
batch job mix: the COS 1.16 job scheduler consistently
achieves user CPU utilization which exceeds 85% and
simultaneously provides high user I/O utilization with
similar job mixes on our CRAY X-MP/24. Nor is it due
to poorly-chosen schedv parameters. Rather it is due to
design oversight in the UNICOS 5.0 process scheduling
software, software that fails to maximize the degree of
multiprogramming for multiple reasons, including an
ineffective strategy for avoiding central memory frag-
mentation.
September 26, 1989
- 2 -
1. Background
1.1 CHPC Established
The University of Texas System Center for High Performance
Computing (CHPC) was established by the University of Texas Sys-
tem Board of Regents in 1986 to serve the research and instruc-
tional supercomputing needs of the academic and health science
component institutions of the UT System. The initial equipment
configuration included a CRAY X-MP/24 supercomputer with 32-
megaword SSD, 9.6 gigabytes of DD49 mass storage and XIOP-based
3420 and 3480 magnetic tape access, all controlled by the COS
1.14 operating system. Support computers included a VAX 8600
front end running VMS and an IBM 4381 file server running MVS.
The CHPC initiated production operations on May 15, 1986. Access
to the services of the CHPC is provided via the Texas Higher Edu-
cation Network (THEnet), a state-wide, heterogeneous network
administered by the UT System Office of Telecommunications Ser-
vices (OTS). See Figure 1.
1.2 CRAY X-MP/24 Saturated
By February of 1987, the CPU resources of the CRAY X-MP/24
had become fully saturated, with typical job backlogs of up to
one week. Clearly, to address the turnaround time problem
without strict allocation procedures, additional CPU resources
were required. Even though larger central memory was being
requested by several users, four megawords had proved adequate to
achieve a degree of multiprogramming that allowed user CPU utili-
zations of above 85% for typical job mixes.
1.3 UNICOS Migration Initiated
Over the next year, UNIX continued its emergence as an
evolving de facto standard operating system for computers in all
performance classes, and our decision to migrate to a homogeneous
UNIX environment was justified on three primary bases. UNIX had
been chosen as the operating system of the future by CRAY
Research and other high-performance computer vendors. Equally
important was CHPC's goal to establish a partnership between com-
puting resources locally available at the UT component campuses,
primarily UNIX-based scientific workstations and near-
supercomputers, and those of the CHPC facility. Achieving this
goal required the uniform network service functionality
approached only in the UNIX environment. Finally, UNIX provided
the best available approximation to our goal of vendor indepen-
dence.
Presuming that the UT System Board of Regents would approve
of a major expansion of the production vector supercomputing
resources provided by the CHPC (approval was subsequently granted
in June, 1989), migration of the UT System CHPC facility to UNIX
was approved in April of 1988. As the COS 1.16-based CRAY X-
September 26, 1989
- 3 -
MP/24 was production-saturated, the vehicle chosen for migration
was a CRAY X-MP EA/14se with 4 megawords of BMR, 8.4 gigabytes of
DD39 mass storage and XIOP-based 3420 and 3480 magnetic tape
access, all controlled by the UNICOS 4.0 operating system. Sup-
port systems included a CONVEX C120 access server running CONVEX
UNIX 6.2 and a Sun-3/280S gateway server running SunOS 3.5. See
Figures 2 and 3. New hardware was installed during October, 1988
and acceptance testing was completed by January 1, 1989. The
plan was that, after 9 to 12 months, the majority of the COS
workload could be moved to the UNICOS context. At that time,
either "COS stragglers" would be accommodated by reversing the
roles of the COS 1.16 X-MP/24 and UNICOS 5.0 X-MP EA/14se, or
UNICOS 5.0 would be run on both X-MP's.
2. Problems
2.1 UNICOS 4.0 Installed
After a brief development period, in which CHPC-specific
accounting and NQS-temporary file ($TMPDIR) modifications were
integrated into UNICOS 4.0, the X-MP EA/14se was made available
for production. As expected, the early job mix proved to be pri-
marily batch, yet we were routinely measuring low CPU utiliza-
tion. Careful investigation of the UNICOS 4.0 source code
revealed that in an attempt to integrate batch scheduling with
more-or-less standard UNIX priority-based process scheduling, the
developers of UNICOS 4.0 had not provided us with the scheduling
flexibility we required. We believe most sites purchase CRAY
supercomputers primarily for high-performance sustainable CPU
throughput. Yet due to the simple negative-feedback-on-CPU-usage
process priority function in UNICOS 4.0, the highly I/O-bound
memory hogs (memhogs) received a much larger share of central
memory occupancy than CPU-bound memhogs. The I/O-bound memhogs
even competed effectively with small CPU-bound processes. CPU-
bound process priorities decayed much faster than that of I/O-
bound memhogs, resulting in less than 50% CPU utilization at
times when many CPU-bound processes were runable. After some
trivial coding changes to make CPU-bound memhogs available for
swap-in sooner, they achieved a larger share of central memory
occupancy. This allowed us to achieve nearly 75% CPU utilization
with the same job mix.
2.2 UNICOS 5.0 Installed
Having heard from CRAY Research employees at many levels
that UNICOS 5.0 was to be the most "feature-rich" and reliable
release to date, due in part to extremely thorough testing under
production loads, we ordered installation materials and documen-
tation on May 15, 1989 and installed UNICOS 5.0.10 for production
on July 11. By September 1, just over seven weeks later, we had
two CRITICAL, ten URGENT and two MAJOR SPR's outstanding against
UNICOS 5.0, including problems with the kernel (primarily the
September 26, 1989
- 4 -
process scheduler), TCP/IP modules, fsck, mkfs, NQS, msgdaemon
and crayperf. Of these SPR's, two CRITICAL's and two URGENT's
against the process scheduling subsystem represent significant
problems for our site.
After some definitions and a very brief overview of UNICOS
5.0 scheduling philosophy, the remainder of this section will
detail several of the process scheduling problems we have experi-
enced. See Figure 4.
Central Memory Scheduling Terminology
cpuhog a process that has exceeded the CPU time limit set
for processes with modest CPU requirements
memhog a process that has exceeded the central memory limit
set for processes with modest central memory require-
ments
hog cpuhog or memhog
hogmem total central memory that can be captured at any time
by hogs
roadhog an in-memory process is defined as a roadhog when the
sum of its size and that of the highest-priority
swap-in process exceeds available user memory
maxbad long-term sleeping processes
kindabad short-term sleeping processes
Central Memory Scheduling Parameter Definitions
Scheduling parameter weight factors: x_in defines weight factor
to be multiplied by an attribute associated with x for processes
residing in central memory when computing their swap-out prior-
ity. Similarly, x_out defines a weight factor to be multiplied by
an attribute associated with x for processes residing on the swap
device (SWAPDEV) when computing their swap-in priority.
+ Throughput scheduling:
mfactor: process size
tfactor: process residence time
+ Political scheduling:
pfactor: process standard UNIX priority
shfactor: process fair share priority
nfactor: process nice value
Scheduling parameter damping/clipping factors:
September 26, 1989
- 5 -
min_inpri runable processes in central memory are excluded from
swap-out until their swap-out priority is lower than
min_inpri, unless max_inage is exceeded
min_outpri runable processes on SWAPDEV are excluded from swap-
in until their swap-in priority is higher than
min_outpri, unless max_outage is exceeded (and the
process is a hog)
max_inage guaranteed central memory residency time for runable
processes; processes that have occupied central
memory for max_inage seconds can be swapped out
regardless of min_inpri
max_outage guaranteed SWAPDEV residency time for hog processes;
hog processes that have been swapped-out for
max_outage seconds can be swapped in regardless of
min_outpri
thrash-blks no more than thrash-blks blocks may be swapped per
thrash-inter seconds if thrash-inter is nonzero
Central Memory Scheduling Algorithm (sched)
sched computes swap-in priorities to order swapped-out run-
able processes for swap in. Similarly, sched computes swap-out
priorities to order swapped-in processes for swap out, to make
room to swap processes in. Swap priorities are computed as a
simple sum of products, using weight factors set via schedv.
Higher numeric values indicate better priority. Thus, the pro-
cess with the numerically largest priority on the swap-in queue
will swap in first. The in-memory process with the numerically
smallest priority will swap out first. Once the "best" candi-
dates for swap-in and swap-out are found, their priorities are
compared. If the swap-in candidate has a priority less than that
of the swap-out candidate, no swap occurs. sched then suspends
itself for one second after which it reevaluates the situation.
Of course, special cases occur for in-memory processes in states
such as roadhog, sleeping or locked-in: any roadhog process is
chosen prior to the "worst" maxbad process; any maxbad process is
chosen prior the "worst" kindabad process; any kindabad process
is chosen prior to the "worst" runable process, within the con-
straints of any damping or clipping factors.
Swap-in priority is computed as follows:
P = priority*pfactor_out + sharepri*shfactor_out +
nice*nfactor_out + size*mfactor_out + time*tfactor_out
Swap-out priority is computed similarly:
P = priority*pfactor_in + sharepri*shfactor_in + nice*nfactor_in
+ size*mfactor_in + time*tfactor_in
September 26, 1989
- 6 -
2.2.1 Central Memory Scheduler Problems
During beta testing of UNICOS 5.0 at NCAR, several problems
were found in sched. One important problem was so-called
"roadhog detection", wherein sched did not detect processes that
were required to swap out in order to fit the highest-priority
process on the swap-in queue into central memory. sched was
modified such that when roadhogs were occupying central memory,
they would be considered exclusively for swap out. Unfor-
tunately, a precondition for roadhog detection was that the pro-
cess be a hog. As released, UNICOS 5.0 schedv parameters declare
no hogmem, and thus roadhog deadlock detection failed until the
default schedv parameters were modified. This resulted in
several uncontrolled shutdowns (because we could not log in on
the system console to run our shutdown script) and our first
UNICOS 5.0 CRITICAL SPR.
As released, the UNICOS 5.0 default schedv parameters result
in other problems as well, the most significant of which is
thrashing. Following is a quote from a posting to the UNICOS
mailing list by Doug Engert of Argonne Labs: "With the default
settings, we were seeing very high swap rates. The situation
went something like this. A number of processes are swapped out
and their priority is increasing. One is swapped in, and its new
priority is calculated, but it is less than others which are
still out, and thus it gets swapped out. We had to set the
min_inpri to much less than the min_outpri to slow this down.
But the intent of the min_inpri and min_outpri is to have a range
where the system can dynamically balance the swapping. A fudge
factor was needed." At CHPC, we were often experiencing similar
behavior with our job mix. Fortunately, Doug Engert went on to
determine a set of schedv parameters that alleviate the thrashing
problem somewhat. Unfortunately, sched has more fundamental
problems that can cause instabilities, not the least of which is
oscillatory behavior due to asymmetries introduced by two
separate functions for computing swap priorities. One alterna-
tive to the fudge factor Doug implemented is the concept of
thrash locks with terms proportional to process size, in addition
to a constant term. This method is used in COS and works quite
well for CHPC job distributions.
Other serious problems with sched are the following, first
described to us by Ted Kline of CRAY Research, during a two-day
UNICOS 5.0 memory scheduler workshop held at CHPC on August 21-
22, 1989 (during which time Ted installed a fix to roadhog
deadlock detection for no-hogmem systems). Suppose A and B are
in-memory processes with process C swapped-out and runable.
There exist conditions wherein sched will swap out A, even though
A will be ordered ahead of C on the swap-in queue, resulting in
immediately swapping A back in. This can continue until
priority(A) < priority(B), at which time similar oscillations can
begin with process B. If process C is ever ordered first on the
swap-in queue, of course it can swap in, but these "scheduler
loops" can go on for some time, wasting CPU and I/O resources.
September 26, 1989
- 7 -
Yet another problem Ted mentioned was that unless the swap-in
priority of runable swapped-out processes exceeds min_outpri,
sched refuses to swap them in unless max_outage is exceeded, even
when adequate central memory is available!
Attempting to schedule processes strictly on the basis of
share priorities, which include costs for central memory and I/O
as well as CPU, we determined that the scheduler loop Ted
described occurs when max_inage is zero. Unfortunately, nonzero
max_inage can result in keeping sleeping processes in central
memory when multiple runable processes are available on the
swap-in queue! Even though the fair share CPU scheduler func-
tions properly, in several situations the use of share priority
alone for memory scheduling is unstable for at least the follow-
ing reason. Suppose CPU-bound, roadhog processes A and B are the
only runable processes over some time interval. Also suppose
their share priorities are extremely close. Swap-in/swap-out
interchanges of A and B can occur at very short intervals. If
these swaps are at VHISP rates, lost CPU utilization might be
hardly noticed, but with disk swapping, very low CPU utilization
results. Again, thrash locks proportional to process size con-
stitute a component of the solution to this kind of problem.
Another inefficiency we have discovered in sched is the fol-
lowing. Once sched has ordered the swap-in queue and chosen the
best swap-in candidate, no prescan of the in-memory processes is
done to ensure that adequate central memory can be reclaimed to
swap it in: sched simply assumes it and begins sequentially swap-
ping out all processes of lower priority than the swap-in candi-
date. The result is unnecessary swapping and lower degree of
multiprogramming, resulting in lower CPU utilization.
Yet another recent discovery is as follows. Suppose sched
has chosen a large process A for swap-in and there exists process
B which is locked-in due to raw I/O. sched will not consider B
for swap-out and might swap out all other processes, including
those with higher priorities than B. After all processes other
than B have been swapped out, if there is inadequate contiguous
central memory to swap A in, sched sets a flag for bio to idle
down B. One oversight here is that B might be a roadhog and thus
must be swapped out for A to fit! Even worse, it can be the case
that B is the only process sched needed to swap out to fit A and
swapping out all other processes first was superfluous, not to
mention inefficient!
As a final example, there is packmem, the module called by
sched to pack central memory when no single gap is adequate to
swap in some process. The packmem algorithm does not pack memory
but rather scans processes in storage address order, attempting
to move the process into a gap adequate to contain the entire
process space, at a lower storage address. The concept of
storage moving a process downward into its lower gap (or upward
into its upper gap), independent of that gap size, is simply not
employed. packmem can actually increase the fragmentation at
September 26, 1989
- 8 -
worst and in many cases only does unnecessary storage moves at
best! Perhaps the most unfortunate side-effect of the packmem
algorithm is that swap-outs followed by swap-ins are sometimes
required to accomplish central memory packing!
There are other scenarios which result in sched instabili-
ties. Many of these have been discovered by Bill Jones of the
CHPC, using Ted Kline's schedvsim simulator. Ted provided our
systems programming staff with this valuable algorithm-modeling
tool during our August workshop.
2.2.2 NQS Scheduling Problems
The first and most important problem we have with NQS is the
lack of a direct coupling between NQS and the share scheduler. A
simple example of the problem follows. Suppose a site prefers to
schedule processes almost exclusively on the basis of share
priorities. Suppose user A, who has recently used far in excess
of his share of resources, submits "queue-limit" batch jobs via
NQS. Now assume that user B, who has no recent usage, submits a
batch job to the same queue. Since user A has used more than his
fair share of resources recently, the share scheduler is going to
allocate memory and CPU resources to A much less frequently than
to other available batch jobs with lower recent usage, thus stal-
ling A's progress through the system. But this is penalizing
user B, the very user who deserves rapid turnaround!
Another problem we see is the lack of a parameter to specify
filesystem resource limits. TMPDIR (and other) filesystem deple-
tion could be controlled if we could schedule batch jobs on the
basis of disk resources in addition to CPU, memory, and tape
requirements.
3. Conclusions and Suggestions
3.1 Four-Megaword Development and Testing
A primary reason for our early discovery of the sched
deadlocks and deficiencies related above is that such problems
are very likely to occur under rich production loads on modest-
resource machines. Sixteen to sixty-four megaword machines with
VHISP swapping rates to SSD memories may mask such problems much
longer than modest-resource machines. Yet four- and eight-
megaword machines still constitute the majority of the X-MP
installed base, a base wherein many sites plan to migrate to
UNICOS prior to purchasing additional hardware. Primarily for
these reasons, but also to increase developers' awareness of user
memory constraints on modest-resource machines, we suggest that
CRAY Research integrate modestly configured four- or eight-
megaword X-MP's into the UNICOS development and production
environments at Mendota Heights, so long as such machines main-
tain prominence in the distribution of CRAY Research's installed
September 26, 1989
- 9 -
base.
3.2 New Category-Based Integrated Scheduler
During the two-day workshop with Ted Kline of CRAY Research,
we presented some of our thoughts regarding an integrated job
scheduler for UNICOS, not completely unlike that of COS. We
believe the distribution of jobs on typical production supercom-
puters is too rich for the two simple categories of batch and
interactive. Furthermore, we believe that job scheduling should
be based on a combined "category-cost" concept, where batch job
cost might typically be the product of process size and resource
time remaining, while interactive job cost might be the product
of process size and resource time accumulated since the last
interaction. This tends to give modest-size and/or modest-
resource-time-limit batch jobs higher priority, while simultane-
ously ensuring that large, long-running batch jobs receive higher
priority as they near completion, allowing their resources to be
freed for other jobs. The cost function suggested for interac-
tive jobs gives truly interactive jobs good service, while the
"interactive grinders" receive lower priority as they compute
without interacting.
During the summary session of our workshop, Ted Kline indi-
cated that he was considering the concept of site-specifiable
categories with independent cost functions and central memory
maxima for each category. This would allow each site to define
their own process categories and associated cost functions. Dur-
ing a memory scheduler run, only runable processes would be
sorted on the basis of category and cost. Central memory would
then be packed in cost order within category memory cutoff lim-
its. To maximize the degree of multiprogramming, a final pass
would then be made with all category memory cutoffs set to
remaining user memory. This strategy must be augmented with
several complementary algorithms, such as swapping out long-term
sleeping processes immediately, at least under conditions of sub-
stantial central memory contention. This is an oversimplifica-
tion, of course, but contains the essence of a paradigm we would
like to see implemented.
3.3 Central Memory Packing
The UNICOS 5.0 process scheduler suffers from lack of
emphasis on keeping central memory efficiently packed with as
many runable processes as possible at all times. This maximiza-
tion of the degree of multiprogramming is very critical in the
single CPU case, not to mention the case of multiple CPU's.
Perhaps sched should be willing to look beyond the front of the
swap-in queue when that process momentarily cannot fit. Several
smaller processes may fit at such times, resulting in better
resource utilization.
3.4 Size-Proportional Thrash Locks
September 26, 1989
- 10 -
We believe that provision of size-proportional thrash locks
is also extremely important, especially for machines with low-
bandwidth swapping devices.
3.5 Properly-Documented Tuning Parameters
In an attempt to allow scheduling flexibility, various tun-
ing parameters were added to the UNICOS 5.0 process scheduler.
The definitions above are paraphrased from the UNICOS 5.0 SCHEDV
(1M) documentation. The parameters do not all function as speci-
fied in that documentation. More importantly, these parameters
interact in some unexpected and unstable ways.
3.6 UNICOS Stability Requirement
We have made a long-term committment to CRAY X-MP hardware
and UNICOS software, predicated on the quality of both. Quality
consists not only of "installability" but more importantly, reli-
ability, high-performance, functionality, flexibility, and sta-
bility. If our user community is to migrate from our stable COS
X-MP/24 platform to the UNICOS X-MP EA/14se, we must have these
components of quality in all future releases of UNICOS.
4. Acknowledgements
I wish to acknowledge the help of Bill Jones and Dan Rey-
nolds of the CHPC in both discovering and solving problems in the
UNICOS 5.0 process scheduler. I wish to acknowledge Ted Kline of
CRAY Research for his help in discovering and solving problems in
the UNICOS 5.0 process scheduler, as well as his willingness to
listen to our thoughts regarding process scheduling in the
abstract. Finally, I wish to acknowledge Doug Engert, for his
discovery of a set of schedv parameters which partially offset
the design deficiencies present in the UNICOS 5.0 memory
scheduler and allow us to run production workloads for some job
distributions.
__________________________
UNIX is a registered trademark of AT&T Bell Labora-
tories. CRAY, COS, UNICOS and X-MP are registered
trademarks of CRAY Research, Inc. IBM and MVS are re-
gistered trademarks of International Business Machines.
VAX and VMS are registered trademarks of Digital Equip-
ment Corporation. CONVEX is a registered trademark of
Convex Computer Corporation. Sun is a registered
trademark of Sun Microsystems, Inc.
More information about the Comp.unix.cray
mailing list