Implementing a multitasking OS on top of UNIX

Thu May 9 23:24:18 AEST 1991

In article <1991May9.055804.6550 at casbah.acns.nwu.edu>
craig at casbah.acns.nwu.edu (Craig Robinson) writes:
>Just for my own edification though, what does happen at the CPU level when
>a process makes a system call?

It depends on the CPU.

The VAX has four `chm' instructions (chm[uesk]).  Unix uses only chmk,
`change mode to kernel'.  This is done as, e.g.:

	pushl	$1		# argument to syscall
	chmk	$1		# exit(1)

The chmk changes to kernel mode, sets the stack pointer to the Kernel
Stack Pointer `ksp' (it was the User Stack Pointer `usp'), and jumps to
the `chmk' vector.  (Actually it reads the chmk vector out of the scb,
and uses the low 2 bits to decide whether to use the kernel stack or the
interrupt stack, or to halt.)  The parameter to chmk is pushed on the
new stack after the previous psl and pc.  That is:

	*--ksp = psl;
	*--ksp = pc;
	*--ksp = <argument to chmk>;

The BSD vax kernel follows this with pushing the T_SYSCALL type (not
actually used, it just makes the trap() and syscall() frames the same)
and the usp, then calls syscall().  To return from a chm? instruction
you pop the chm argument (`tstl (sp)+' or `addl2 $4,sp') and execute an
`rei' (the semantics of rei are horribly complicated; see a VAX
architecture manual).  The BSD kernel takes advantage of the register
save mask to get the user's registers saved at entry to syscall()
itself.  (They are thus on the stack and can be modified and will be
reloaded on return automatically.)

The Tahoe has a `kcall' instruction.  It works a lot like the VAX chmk.

The 680x0 has 16 `trap' instructions.  (Well, actually one, with a
parameter in the range 0..15.)  The OS author decides to use one for
system calls, and chooses how to encode the calls.  SunOS, HPUX, and
Utah's HP-BSD all use trap 0 and put the system call number in d0 (with
the rest of the parameters on the stack).  The trap switches to kernel
mode (thus getting the kernel stack pointer), pushes the pc (4 bytes)
and the sr (2 bytes), and jumps through the trap vector.  The BSD
trap-0 vector clears another 2 bytes to realign the stack (important on
the 680[234]0 for performance), then pushes all the user's registers
with `moveml #0xffff,sp at -'.  The BSD kernel then saves the user SP (the
moveml pushed the kernel stack pointer) and pushes the system call
number again (!) and calls syscall().  It then pops the system call
number, reloads the user stack pointer, does a `moveml sp at +,#0x7fff' to
recover everything except the user stack pointer, adds 6 to sp to pop
the usp and the alignment word, and then jumps to a routine that fakes
a VAX `rei' (checking for pseudo ASTs: rather silly but a bit difficult
to clean up).

The SPARC has the `t' (trap) instruction, which has a 7-bit parameter
for (in effect) 128 trap instructions.  The OS author decides to use
one for system calls, and chooses how to encode the calls.  SunOS and
BSD both use software trap 0 and put the system call number in %g1,
with the rest of the parameters in the trapping routine's %o0..%o5.
(For the indir system call, the 7th parameter is found on the routine's
stack.  The usual case involves no memory traffic, however.)  All SPARC
traps, including interrupts, work the same way:  They decrement the
current window pointer in the psr, write the pc and npc into what are
now %l1 and %l2, copy the psr `S' (supervisor) bit into the `PS'
(previous supervisor) bit, clear the `ET' (enable traps) bit, and set
pc and npc---npc is the `next' pc, exposed due to delayed branches
---to the trap base register plus 16 times the trap vector index.
(Software traps start at 128 and go to 255.  Hardware traps use the
range 0..127.)  That is all that the hardware does: it does not set up
a kernel stack pointer, or save things to a stack.  The rest is up to
software.  My kernel:

      -	branches to syscall and saves %psr in %l0 in the delay slot;

      -	at syscall, invokes a hairy macro called TRAP_SETUP:
		if (trap came from kernel mode) { // i.e., psr<PS> is set
			if (we are in the trap window)
				save the trap window somewhere;
			%sp = %fp - stackspace; // here stackspace=80
		} else {
			compute the number of user windows;
			if (we are in the trap window)
				save the trap window somewhere;
			%sp = (top of kernel stack) - stackspace;
		}
	where the number of user windows is:
		cpcb->pcb_uw = (cpcb->pcb_wim - 1 - CWP) % nwindows
	which is computed via table lookup (the pcb_wim field is
	maintained by software; it is simply log2(%wim));

      -	enables traps, stores the saved psr (%l0), pc (%l1), npc (%l2),
	%y (read into %l3), the values of %g1 through %g7 and the
	caller's %o0 through %o7 (now our %i0..%i7) into the 80 bytes
	reserved above;

      -	calls syscall(), passing the address of the stuff just built
	on the kernel stack.

Note that it is possible, but wrong, to get a kernel mode system call.
Therefore, part of the work in TRAP_SETUP could be dispensed with, but
for the fact that the trap window must be saved anyway, even if we are
just going to panic.  Since the delay slot for that test is filled for
both cases, this is only a single instruction; the loss is minor.
The `save the trap window somewhere' is complicated but is done as a
subroutine (with linkage being stored in %l4, leaving only 3 registers
free in the save code in some cases, but that turns out to be *just*
enough).
-- 
In-Real-Life: Chris Torek, Lawrence Berkeley Lab CSE/EE (+1 415 486 5427)
Berkeley, CA		Domain:	torek at ee.lbl.gov