Spinlocks and Semaphores Revisited
Jim Barton
jmb at patton.sgi.com
Wed Sep 13 08:14:28 AEST 1989
It was pointed out to me that I posted the section of the software spec
with the "Company Confidential" still on it. Guess I'll be in engineering
a long time - I haven't been able to train myself properly about such things.
In any case, here is a distributable version with the "Confidential" removed,
so you can copy it, give it away, etc., but try to maintain it in the state
it is.
-- Jim Barton
Silicon Graphics Computer Systems "UNIX: Live Free Or Die!"
jmb at sgi.sgi.com, sgi!jmb at decwrl.dec.com, ...{decwrl,sun}!sgi!jmb
---------- cut here -----------
Section 2 - 1 -
Section_2:_Software_Synchronization
James M. Barton
Silicon Graphics Computer Systems
Copyright (c) 1989 Silicon Graphics, Inc.
All Rights Reserved
($Revision: 1.12 $ $Date: 86/09/23 17:27:33 $)
1. UNIX_On_Multiprocessors
Much of the discussion in this section follows that in [1].
This section assumes that the Clover 2 multiprocessor
structure will consist of two or more CPU's that share
common memory and peripherals. Such a structure can
potentially provide much greater throughput since processes
can run concurrently on different processors. Each CPU
executes independently, but all of them execute a single
copy of the kernel. In general, processes are unaware that
they are running on a multiprocessor, and can migrate
between processors transparently. Unfortunately, a process
does not consume less CPU time.
Allowing several processors to execute simultaneously in the
kernel causes integrity problems unless protection
mechanisms are used. The original design of UNIX assumed
that only a single processor would ever execute the kernel.
Thus, kernel data structures could be protected by not
preempting a process in the kernel. A kernel process gives
up control of the processor voluntarily, through the normal
UNIX sleep/wakeup mechanism.
There are three ways in which corruption of kernel data
structures can be prevented on a multiprocessor:
1. Execute the kernel on one processor only, which
protects all kernel data structures but serializes all
system functions.
2. Serialize small sections of code (critical regions)
with locking primitives. This approach adds locking
overhead, and may increase context switching.
3. Redesign algorithms to avoid contention for data
structures. In many cases, this may be impossible to
do and still maintain reasonable performance.
AT&T currently sells a multiprocessor version of UNIX for
the 3B20A attached processor system. In this UNIX kernel,
locking primitives are used to create short critical
sections where contention for kernel data structures may
occur. This approach was chosen because, if properly
designed and implemented, the problems of such locking can
be managed easily without sacrificing system throughput or
causing major redesigns of kernel algorithms. By partially
September 12, 1989 Silicon Graphics Inc.
Section 2 - 2 -
porting and otherwise using this code as an example, it will
be possible to implement a multiprocessor kernel in a short
period of time that meets the response and throughput
requirements of Clover 2.
The remainder of this section is devoted to describing the
locking primitives, how they are used in general and how
they will be used in the Clover 2 kernel.
2. Semaphores
2.1 Spinlocks
2.1.1 General_Description The class of locking primitives
used in the UNIX kernel goes under the general name of
semaphores. The simplest semaphore is a spinlock. To
access some critical resource, a processor attempts to
acquire the lock using the simple algorithm:
loop:
if lock is set goto loop;
set lock;
The lock is obviously released simply by clearing it, which
will allow at most one other processor to obtain it.
If the testing and setting of the lock are not atomic, then
it is possible that two processors may believe they have
obtained the lock at the same time, or that neither has
obtained the lock. The first case can result in corrupted
data structures, while the second results in deadlock.
Spinlocks can be implemented atomically in software (see
[1]), however this solution is slow and may potentially
waste a substantial number of cycles. Instead, most modern
computers provide atomic test-and-set instructions, which
allow a processor to obtain the lock in a single operation.
2.1.2 Spinlock_Usage Spinlocks are used whenever the time
in which the lock is actually held by any one processor is
very short. In these cases it is more efficient simply to
wait for the lock rather than attempting to find some other
useful work to do; the time spent searching for other work
may exceed the time spent waiting for the lock.
In a multiprocessor system, it is difficult to know when a
spinlock may be more efficient than a mutual exclusion
semaphore. The use of spinlocks must be empirically
determined during system tuning, when the effects can be
measured. Any very short operation is a candidate for spin
September 12, 1989 Silicon Graphics Inc.
Section 2 - 3 -
locking.
Example uses of sinplocks are for controlling free list
access or performing linear scans of system tables.
2.1.3 Spinlocks_and_Interrupts One important aspect of
locking that becomes apparent in any interrupt-driven system
is the possibility that an asynchronous routine started by
the hardware may take control of the CPU while it is
attempting to obtain the lock. If the interrupt occurs
after the lock has been obtained, then the time for which
the lock is held can increase dramatically, seriously
impacting system performance.
If a spinlock is used to protect a resource to which an
interrupt routine needs access, then similar problems arise.
The interrupt routine remains active while spinning,
blocking other important interrupt processing and degrading
system performance.
Finally, the consequence if an interrupt routine attempts to
obtain the same lock that normal processing has just
obtained is guaranteed deadlock of the system.
To avoid all these problems, interrupts are disabled while
attempting to obtain and while holding a spinlock. This
re-enforces the necessity for spinlocks to be held as short
a time as possible.
2.1.4 Hardware_Support The Clover 2 hardware will supply a
large set of "virtual" locks, which will provide atomic
test-and-set capabilities between the processors in the
system. This provides a powerful mechanism for implementing
spinlocks, since traffic on the multiprocessor bus can be
avoided. The importance of having a very large set of these
locks will become apparent in the following sections.
2.2 Mutual_Exclusion_Semaphores
Once spinlocks are available, it becomes possible to
implement more sophisticated types of semaphores. Once of
these is the mutual exclusion semaphore, which is used to
protect critical sections of code in a more intelligent
manner than a spinlock. Such a semaphore is a data
structure which consists of a counter and a queue of
processes. The counter is initialized to the value one.
There are two operations possible on such a semaphore,
allocation, often called the acquire operation, and de-
allocation and wakeup, often called the release operation.
A process attempting to acquire the semaphore does so using
September 12, 1989 Silicon Graphics Inc.
Section 2 - 4 -
the following (abbreviated) algorithm:
if (counter <= 0) then
add this process to end of queue of processes;
block this process (allow another to run);
on return, decrement the counter;
continue with critical section;
else
decrement counter;
continue with critical section;
fi
The release operation is performed in the following manner:
increment counter;
if (a process is waiting on the queue) {
decrement counter;
unblock the first process on the queue;
}
Obviously the above operations must be atomic in nature; a
spinlock is used to protect the semaphore, and is locked
before and unlocked after each of the above operations.
2.2.1 Usage Mutual exclusion semaphores are used where the
length of the critical section to be protected is long or
variable in length. This allows other processes to run and
do useful work even though certain work must wait.
Examples of such usage are when holding region entries for
managing segments of virtual memory, or when using an inode
to access disk files or devices.
2.2.2 Counting_Semaphores The mutual exclusion semaphore
is a special case of a general counting semaphore. A
counting semaphore is initialized to a value greater than
one, allowing more than one process to acquire the semaphore
at once. Counting semaphores are often used in resource
allocation, where there is a fixed number of resources
available. Another use is to provide atomic counters; the
semaphore is released each time the counter should be
incremented, but never acquired.
The AT&T multiprocessor code uses many atomic counters but
no counting semaphores; the Clover 2 product may include the
use of counting semaphores for controlling certain
resources.
September 12, 1989 Silicon Graphics Inc.
Section 2 - 5 -
2.3 Synchronization_Semaphores
This third type of semaphore is used to synchronize
different processes (possibly on different processors). A
synchronization semaphore is actually a mutual exclusion
semaphore, however the acquire and release operations are
performed by different processes, affecting a rendezvous
operation.
Example usage is in awakening the paging daemon from the
clock handler when memory is in short supply and a
sufficient time interval has elapsed.
3. Multiprocessor_Semaphore_Primitives
The AT&T 3B20A kernel introduced several new primitives to
the kernel to provide semaphore facilities. One of the
stated goals of the implementation was to provide a kernel
that could run on either uniprocessors or multiprocessors
with no modification to the kernel code, and to achieve this
goal with only a minor decrease in uniprocessor performance.
This goal was met.
Each set of semaphore primitives is described below.
Spinlock Primitives
lock_t lock;
spsema(&lock); - acquire the lock, spin until acquired
svsema(&lock); - release the lock
On a uniprocessor the spinlock primitives are commented out
using the C preprocessor (since spinlocks are meaningless on
a uniprocessor):
# define spsema
# define svsema
The actual implementation of the spinlock is done using a
3B20 microcoded instruction for speed. On the MIPS
processor, there is no test-and-set instruction. Thus,
hardware support must be provided to insure that a high-
performance path exists for performing this operation. The
Clover 2 system will provide a separate, high-speed bus for
synchronization among the processors in the complex. This
bus will provide similar functionality to the SLIC VLSI chip
done for the Sequent Balance multi-processor [2] The Sequent
implementation provides 64 test-and-set variables which are
global to all processors. In addition, a number of 32-bit
September 12, 1989 Silicon Graphics Inc.
Section 2 - 6 -
registers were available that could be transferred between
processors in an atomic fashion, provided a general message
facility. This facility was used for distributing
interrupts from I/O devices and communication between the
processors.
The Clover 2 implementation will be based on cheaper gate-
array technology, and will be called the SYNC bus for
brevity. Access to the SYNC bus will be through normal load
and store operations in the physical address space of each
processor. The SYNC bus will provide 4096 test-and-set
locks. Each lock will be atomic in nature, thus a processor
can use these locks for access control, busy waiting, or
other uses. The locks will be grouped into 128 pages, each
having 32 locks. This allows a group of locks to be mapped
into the virtual address space of the processor, allowing
user access to certain locks if so desired. Other
facilities will be provided by the SYNC bus; they are not of
interest here 1.
Unlike the Sequent system, such a large number of locks
allows each semaphore in the kernel to have a private lock,
improving the overall performance of the system. Since no
memory cycles are used once the busy-wait instruction has
been loaded from memory, all memory traffic is eliminated.
With high-speed spinlocks available, the semaphore
operations can be defined.
Synchronization Primitives
sema_t sema;
psema(&sema, PRI); - acquire the semaphore
vsema(&sema); - release the semaphore
cpsema(&sema, PRI); - conditionally acquire the semaphore
Algorithmically, psema() is implemented as:
__________
1. The SYNC bus will provide 4 8-bit registers that can be
incremented, decremented or set in an atomic fashion.
This allows certain algorithms to be speeded up. In
addition, the SYNC bus will be used for messaging
interrupts both from the I/O system and between
processors, eliminating this load on the system memory
bus.
September 12, 1989 Silicon Graphics Inc.
Section 2 - 7 -
BEGIN
acquire semaphore lock;
if (count <= 0) {
decrement count;
acquire semaphore queue lock;
release semaphore lock;
add process to semaphore queue;
release semaphore queue lock;
switch to another process;
on return, return to caller;
}
else
decrement count;
release semaphore lock;
return to caller;
END
The routine vsema() is implemented as:
BEGIN
acquire semaphore lock;
increment count;
if (count > 0) {
acquire semaphore queue lock;
if (a process is queued) {
remove process from queue;
unblock process;
decrement count;
}
release semaphore lock;
release semaphore queue lock;
return;
}
release semaphore lock;
END
The routine cpsema() is used by a process which does not
wish to block if the semaphore cannot be acquired. This is
used in two instances. First, if the caller is an interrupt
routine wishing access to a data structure, than it may not
block. Thus, it uses cpsema() to acquire the semaphore, and
takes some other action if the critical section is in use.
Second, some algorithms require that multiple semaphores be
used when updating related data structures. In this case,
it is necessary to avoid blocking so that the system does
not deadlock. This type of algorithm usually appears as:
September 12, 1989 Silicon Graphics Inc.
Section 2 - 8 -
loop:
psema(&sem1);
.
.
.
if (cpsema(&sem2)) {
vsema(&sem1);
goto loop;
}
.
.
.
vsema(&sem2);
vsema(&sem1);
Cpsema() is also used in cases where semaphores must be
acquired in a different order than they will be released.
This is also a case where only a conditional operation will
avoid deadlock. Cpsema() is implemented as:
BEGIN
acquire semaphore lock;
if (count <= 0) {
release semaphore lock;
return failure;
}
else
decrement count;
release semaphore lock;
Mutual Exclusion Primitives
sema_t sema;
appsema(&sema, PRI); - acquire the semaphore
apvsema(&sema, PRI); - release the semaphore
apcpsema(&sema); - conditionally acquire semaphore
These primitives are implemented differently depending on
whether the target system is a uniprocessor or a
multiprocessor. The following C preprocessor definitions
control this:
September 12, 1989 Silicon Graphics Inc.
Section 2 - 9 -
# ifdef multiprocessor
# define appsema psema
# define apvsema vsema
# define apcpsema cpsema
# else
# define appsema
# define apvsema
# define apcpsema
# endif
Thus, on a uniprocessor, these operations are not performed,
since they would only add overhead and perform no useful
purpose. On a multiprocessor, these operations are
equivalent to the previously defined psema() and vsema()
operations.
Miscellaneous Primitives
sema_t sema;
initsema(&sema, value); - initialize the given semaphore
valusema(&sema) - return the value of the semaphore
4. Uniprocessor_UNIX_Semaphore_Usage
Although not explicitly shown in the code, normal UNIX uses
several implicit semaphores to manage kernel data space in a
crude manner.
First, a single semaphore is used to protect all data
structures from access by other processes: the mode, e.g.,
kernel or user. When in kernel mode, the current process
may not be preempted, protecting kernel data structures.
This is obviously a crude method of protection, but has the
advantage of elegance and simplicity. This semaphore is a
mutual exclusion semaphore.
Second, the interrupt level of the processor is used to
protect from access by interrupt routines. There are as
many of these semaphores as there are interrupt levels in
the processor. The kernel routines used to acquire and
release these semaphores are the spl routines. These
semaphores are spinlocks, since the interrupt is blocked by
the processor only for short periods of time, and other
processing (except higher priority interrupts) cannot
continue while the lock is held.
Third, process synchronization is performed using the sleep
and wakeup primitives, which provide an inefficient yet
elegant method for synchronization. The usual sequence for
September 12, 1989 Silicon Graphics Inc.
Section 2 - 10 -
acquiring the semaphore is:
while (flag & BUSY) {
flag |= WANT;
sleep(&flag);
}
flag |= BUSY;
The sequence for releasing the semaphore is:
flag &= ~BUSY;
if (flag & WANT) {
flag &= ~WANT;
wakeup(&flag);
}
Thus, if more than one process is waiting on the semaphore,
they will all awake at once, and must contend for the
semaphore once again.
5. Multiprocessor_UNIX_Semaphore_Usage
A multiprocessor UNIX kernel must abandon the mode
semaphore, since multiple processors may be executing in the
kernel at once. The sleep and wakeup primitives are
replaced with synchronization semaphores. This provides a
more efficient implementation, since only the first process
waiting on the semaphore will awake, and that process will
hold the semaphore once it does.
Even through the multiprocessor UNIX kernel abandons the
mode semaphore, it still disallows preemption of kernel
processes on the same processor. This is for performance
reasons. For example, consider the case where a kernel
process is preempted, but the next process chosen to run
needs access to the same resource. Eventually control will
pass back to the preempted process, but only after a
substantial amount of time and several unnecessary context
switches.
At points where the uniprocessor kernel used interrupt
blocking to protect from interrupt routines, the
multiprocessor kernel must also provide spinlocks to protect
from other processors. Thus, a common sequence in the
multiprocessor kernel is:
September 12, 1989 Silicon Graphics Inc.
Section 2 - 11 -
x = spl6();
spsema(&lock); /* spsema() is discussed below */
.
.
.
svsema(&lock); /* svsema is discussed below */
splx(x);
The most difficult change is the use of mutual exclusion
semaphores to protect kernel data structures that previously
were protected by the mode semaphore. Thus, areas such as
the buffer cache, process management and virtual memory
management must be modified to use semaphores to insure
exclusive access to structures.
5.1 Semaphores_and_Signals
One aspect of the normal sleep/wakeup mechanism is that
signal handling may be performed within these primitives if
desired by the caller. This allows long waits to be
interrupted by the user or application and proper cleanup to
be performed.
Thus, both mutual exclusion and synchronization semaphores
provide the means by which a signal can be ignored or
handled, in the same manner as that used by sleep/wakeup.
An example call to a mutual exclusion semaphore is:
psema(&sema, PPRI);
psema(&sema, PPRI | PCATCH);
In the first case, if the priority PPRI is greater than
PZERO, than a signal which is sent to the process will cause
the process to wake up, cleanup it's entry on the semaphore
queue, and use longjmp() to return to the system call
handler for return to the user. If the priority is less
than PZERO, than such signals will always be ignored, and
the process can only be woken by:
vsema(&sema);
In the second case, the PCATCH flag causes control to be
returned to the caller when a signal occurs (if the priority
is greater than PZERO, as before). Thus, the caller can
take any necessary cleanup actions before returning to the
application.
September 12, 1989 Silicon Graphics Inc.
Section 2 - 12 -
6. Semaphore_Performance
In a semaphored system, there are a number of different
factors that effect the performance of the system.
Obviously, the more semaphore operations performed, the
fewer cycles that will be devoted to real work. The speed
with which a semaphore can be accessed and updated is also
critical. AT&T dealt with this problem by micro-coding
special instructions to handle the most common semaphore
operations.
There is one factor, however, that has an overriding effect
on performance. This is the hit ratio on a semaphore. This
is defined as the ratio of the number of times a semaphore
could be acquired immediately to the number of attempts to
acquire the semaphore. By examining the above algorithms,
the reader can see that the optimal path is to immediately
acquire the semaphore - blocking a process is an expensive
and time-consuming operation.
During construction of the multiprocessor kernel, AT&T
discovered that a hit ratio of at least 95% on all
semaphores was necessary to achieve adequate performance of
the system. To achieve this goal, it is often necessary to
break up data structure accesses into smaller chunks, and to
isolate accesses as much as possible. For instance, the
buffer cache is broken up into a large set of hash buckets
that can be searched individually, reducing the contention
on any one semaphore.
Fortunately, AT&T has already done most of this work,
providing the Clover 2 effort with high performance debugged
algorithms for handling most of these cases.
7. Driver_Semaphore_Use
Clover 2 will obtain most of it's UNIX drivers from the
Clover I system. Converting drivers to semaphore use would
be a long, costly and painful process, perhaps not worth the
effort. To avoid the same situation for the 3B20A port,
AT&T added the concept of driver semaphores to the kernel.
Since a driver can only be entered in certain well-defined
ways, it was possible to place a semaphore sequence around
each entry to the driver. In general, these entry points
are the open, close, read, write and ioctl entries. This is
much like the mode semaphore in normal UNIX, in that it
prevents any other process from entering the driver
simultaneously. Three separate types of protection are
provided.
September 12, 1989 Silicon Graphics Inc.
Section 2 - 13 -
First, the entire driver can be semaphored on it's major
device number. Before the kernel invokes the driver, it
calls psema() to lock the driver semaphores, and after
return from the driver it calls vsema() to unlock the
semaphore.
Second, each minor device number access to the driver can be
semaphored. Instead of locking a single semaphore, each
possible minor device number has a semaphore associated with
it. The kernel locks the semaphore associated with the
minor device before entering the driver and unlocks it after
return. For example, this used in protecting the tty
driver, which maintains independent structures for each
minor device supported.
Third, a driver can be given no protection, in which case it
is responsible for protecting it's own data structures.
This means that structures within the driver that are shared
between processes or with interrupt routines must be
protected with semaphores.
To protect from interrupt access while a driver semaphore is
locked, the kernel checks the driver semaphore before
launching the interrupt routine. If the semaphore is
locked, the kernel queues the interrupt and returns. The
code which unlocks the driver semaphore checks for the
queued interrupt, and if one exists it launches the
interrupt routine at that time. Note that this behavior is
only possible with major device locking. A driver that uses
minor device locking must be modified to block the interrupt
routine explicitly (if it does not already do so).
If a driver choses to access kernel data structures, it will
not be possible to port it directly to the multiprocessor
environment. This is because it will not follow the
kernel's locking strategy for the structures accessed. Such
drivers must be modified to use the proper locking strategy
before they can be used in the multiprocessor kernel.
The Clover 2 kernel will implement the 3B20A driver concept,
maximizing the use of currently existing drivers. In most
cases, it should be possible to directly use drivers written
for Clover I. It is expected that some drivers will require
a porting effort.
September 12, 1989 Silicon Graphics Inc.
Section 2 - 14 -
8. Further_Reading
Much of this discussion is based on the actual code for the
AT&T 3B20A implementation, as well as [3] and [1].
September 12, 1989 Silicon Graphics Inc.
Section 2 - 15 -
REFERENCES
1. Bach, Maurice J., The Design of the UNIX Operating
System, Prentice-Hall, Englewood Cliffs, N. J., 1986.
2. Beck, B., and Kasten, B., VLSI Assist in Building a
Multiprocessor UNIX System, Sequent Computer Systems,
Portland Oregon, not dated.
3. Bach, M. J., and Buroff, S. J., Multiprocessor UNIX
Operating Systems, AT&T Bell Laboratories Technical
Journal, Vol. 63, No. 8, October 1984.
September 12, 1989 Silicon Graphics Inc.
More information about the Comp.sys.sgi
mailing list