number of nfsd processes to start (L

Wed Jan 10 08:22:56 AEST 1990

In article <4116 at brazos.Rice.edu> monty at delphi.bsd.uchicago.edu (Monty Mullig) writes:

   The man page for nfsd suggest that "four is a good number" of nfsd process
   to start, but doesn't give any further information on how to choose the
   best number to start.

The answer will be long... It also discusses how to improve performance in
general.

In blue sky theory, the number of nfsd processes should be equal to the
number of discs you have plus two; this to cover the possibility where all
the discs are busy, one nfsd is working the ethernet (if you have more
than one ethernet interface more than one nfsd could be reading or writing
to the ethernet) and one nfsd is running the CPU as well.

In practice, this may well be too optimistic, because the number of
concurrently busy nfsds is liable to virtually never reach its theoretical
maximum.

If you have fewer nfsd processes, specially under SunOS 4, the number of
context switches will go down dramatically, especially if your server is
used mostly for NFS service. Another way to get context switches down is
conceivably to tune a bit the NFS request sizes, but here I'd like to hear
somebody else's experiences. Remember also that nfsds are activated FIFO,
not LIFO, and this means that they will *all* be active, and thus each of
them will, even with low traffic, consume one of the precious SUN MMU
slots (one of the reasons why nfsd should be multithreaded, not
multiforked).

    We have 10 diskless clients and about 20 PCs running off of a
    4/280 and our diskless 3/50s have recently begun to run noticably
    slower (we just added 4).

This is quite a load. Especially on the ethernet interface. Tree like
communication patterns on Ethernet are known *bad*.

   Should we increase the nfsd processes? 

Definitely *not*. You don't add any processing power by adding new
daemons, you only add opportunity for interleaving, and this cannot go
higher than the number of devices that can be busy at one time.

On the contrary, context switching is likely to become more frequent, and
remember also that the page context cache on a SUN has a fixed number of
slots, and having more active processes than the number of slots
(tipically 8 or 16, the 4/280 may have 32) is *A BAD IDEA*.

       our 4/280 has one controller on each drive, with the client swap
       partitions on a rimfire controller and a hitachi 892 MB drive.
       we have 32 MB on the 4/280.

What you should do is increase the buffer cache size on the clients (say
to 20-25% of memory), by patching in a suitable way 'bufpages' (number of
8 kbyte buffer slots) and 'nbuf' (number of buffer headers, should be
about no smaller than four times the value of 'bufpages') increase it on
the server (say to 40-50% of memory), and balance the swap and root
partitions across the two drives.

	you may want to do the following:

		# 4 Meg (512 pages) Sun3/50 workstation
		adb -w /vmunix <<@
		bufpages?W	0t100	# 20-25%, cache locally
		nbuf?W 		0t600	# small dirs/files cached
		@

		# 32 Meg (4096 pages) Sun4/280 file server
		adb -w /vmunix <<@
		bufpages?W 	0t1200	# 30-35%, pointless more than 10 megs
		nbuf?W 		0t5000	# probably larger files cached
		@

You may want to distribute your filesystems as follows:

	drive 1: root (including /var, /private) plus /usr (/share, ...)
	drive 2: swap plus /users

You often want to have half of the client's roots and swaps on the first
disc, and half on the second disc, and viceversa. It may also be a good
(but probably marginally so) idea to duplicate the read-only filesystems
on both drives, and have half of the clients use one copy and half the
other copy. For example, assuming your clients are in sets A and B:

	drive 1:	server root+/private
			set A roots
			set A shared (used also by server)
			set B swaps
			set B homes

	drive 2:	server swap
			set B roots
			set B shared (a copy of set A's)
			set A swaps
			set A homes

Some extra care can be taken for example to ensure that the high traffic
filesystems, be them usually the shared binaries and libraries, or the
user filesystems, be in the middle of the discs, to minimize expected arm
motion (the layout above reflects this). Each user should be notionally
assigned a most frequently used workstation, and the home directory for
the user reckoned to be in the same set as the workstation.

You should *very* seriously consider taking both discs off your 4/280 and
putting them on a smaller machine each (each with a dumping device, e.g.
an exabyte). Unless you configure a lot of nfsd processed that chew up
context switch time, an NFS server is strongly io bound, both in ethernet
board and disc bandwidth; CPU speed and main memory size almost don't
matter.

By having two discs on two different machines you are putting each disc
behind its own ethernet interface, you have a two-rooted communication
pattern, and you make each disc served by a whole machine. If you follow
my ideas above and split the load between the two discs as I have
suggested, you will be able to gain additional performance, as for example
any program that copies from user space to /tmp and back will involve
*three* machines in parallel, with the potential for significant
overlapping.

You also free your expensive, fast 4/280 as diskless compute server, and
you could use the Purdue system that automagically selects the least
loaded machine for executing commands, or statically replace known piggy
(in either memory or time) applications (e.g.  troff, nroff, lisp) with
scripts like 'exec rsh sun4 troff "$@"'...

	If you do this, the buffer cache allocations (always assuming
	you are still running SunOS 3) could be:

		# 4 Meg (512 pages) Sun3/50 workstation
		adb -w /vmunix <<@
		bufpages?W	0t90	# 20-25%, cache locally
		nbuf?W 		0t540	# small dirs/files cached
		@

		# 4 Meg (512 pages) Sun3/50 file server
		adb -w /vmunix <<@
		bufpages?W 	0t200	# 50%, what else use memory for?
		nbuf?W 		0t1000	# probably larger files cached
		@

		# 32 Meg (4096 pages) Sun4/280 compute server
		adb -w /vmunix <<@
		bufpages?W 	0t600	# 15%-20%, CPU/memory bound jobs
		nbuf?W 		0t3000	# probably larger files cached
		@

You should consider adding a local swap+/tmp disc (in your case I would
suggest something like 300 MBytes, two thirds to swap, a third to /tmp --
you don't want to support address spaces up to more than say 6-7 times
your main memory, because than trashing is virtually guaranteed) compute
server, so that it may be used especially efficiently for programs that
require large address spaces, or have large intermediate files, so that
you can save disc space by having smaller than otherwise necessary swap
(for example, a 4 Meg Sun 3/50 might well do with just 8 Megs swap,
depending on which windowing system you use, etc...) and private
allocations to the workstations.

Adding a local, cheap, small (say 40 megs) disc to your workstations for
swap and /tmp and /private|/var, can result is considerable reductions in
ethernet traffic and load on the servers. It is useless to have fast discs
on the servers if the path to them is busy, and/or choked by the ethernet
boards.

Sharing on a remote disc home directories and executables and libraries
makes sense as to saving space, simplifying dumping, simplifying
administration. Having multiple workstations instead of terminals can
offload high overhead user interactions or processes (edits, small
compiles), but that require good response time, from the expensive compute
servers (on many a PDP or VAX, when vi was introduced, or uucp was
running, interrupt overheads killed the CPU, and we don't want to
reproduce this again...).

Probably also does not cost much in performance (may also help if
the server discs are fast, as most large discs are, and cost per
byte is chaper than multiple small discs, as long as the path to
them is not choked).

Both user files and shared utilities have expected good locality:

	users don't often edit or compile hundreds of different large
	files in a session) and/or caching profile,

	executables can be made sticky and fetched repeatedly
	from the swap partition, because users don't often use hundreds
	of different commands in rapid succession in a session,

	all shared material is read only, like libraries, etc...

On the contrary, swap and temporary or spool files are guaranteed to be
bad for being locally cached, either because they aren't (swap is not
cached), or because they are guaranteed (most temporaries or spool files)
written as often as read, and NFS is essentially write-thru, assuming that
reads are much more frequent than writes.

Finally, check out with monitoring sw all of the above. Using netstat,
iostat, vmstat will give you a lot insight. Use also tcpdump, and other
ethernet traffic analyzers. Summarize accounting data and look at user's
command usage patterns. Know what type of work they are doing, and nudge
them towards using the compute servers etc.. if appropriate.

In summary: doing a good configuration requires MY PRECIOUS AND UNIQUELY
DEEP ADVICE, as revealed to the gasping masses in this article [;-) ;-)
;-)], and/or knowledge of OS design principles (e.g.  implications of the
sicky bit) and performance characterization (e.g. expected cache hit rates
for various types of files), and willigness to do analysis and monitoring
(essential, because guidelines such as mine must be adapted).  

Piercarlo "Peter" Grandi
ARPA: pcg%cs.aber.ac.uk at nsfnet-relay.ac.uk
UUCP: ...!mcvax!ukc!aber-cs!pcg
INET: pcg at cs.aber.ac.uk