fix to TCP hangs and slow transfers

jbn at wdl1.UUCP jbn at wdl1.UUCP
Sat Mar 15 07:55:30 AEST 1986


      There are actually two bugs here.  4.3BSD(beta) has the sequence
number bug as described; I posted a fix for this a few weeks ago and it
follows below.  The very slow connection bug is in a formal sense a bug
in the TCP specification, in that it is possible to implement window management
policies that result in near-stalled transfers.  In 4.3BSD, once the zero 
window state has been reached, notification of more window will not start
transmission unless enough window is available to send one maximum-sized
segment.  If the implementation at the other end sends a window notification
only when the available window goes from zero to nonzero, and then holds off on
further window notification until some data is recieved, the TCP
connection stalls.  Eventually the zero window probe mechanism gets things
going again, but you typically get only one window's worth of data per
zero window timeout interval, which slows things down to a crawl.  This
is strictly speaking a bug in the receiver end; 4.3BSD is not wrong here.
This one you have to fix at the other end.
       We have observed this bug when talking to our UNET implementation
(which we fixed) and Imagen laser printers (about which we informed Geof
Cooper at Imagen.)  A work-around for Imagen users is to reconfigure the
software with a maximum IP datagram size of 576 instead of the default
size based on Ethernet packet size.  This works because it negotiates down
the maximum TCP segment size on the connection, making 4.3BSD act on the
Imagen's window updates of 1K or so.
       Incidentally, when studying problems like this, the "trpt(8C)" program,
which edits the headers saved by socket-level debugging, is immensely useful.
The program can also be easily modified to print out any other fields of
interest in the TCP control tables.
   
					John Nagle

   ===REPOSTING OF BUG FIXES===

      In response to popular demand, I am sending out two fixes to 4.3BSD
(beta release).  Fix #1 affects interoperability with non-4.xBSD systems,
apparently including TOPS-20 machines.  Fix #2 reduces network congestion
on long-haul nets.  (Yes, yet another of Nagle's continuing attempts to
get network congestion under control.)  The effect of #2 is substantial;
in some situations, an order of magnitude improvement in file transfer
speeds will be observed.
      With these in, 4.3BSD TCP behaves quite well.  In 4.3, all the right
machinery is there, but there are a few easily-fixed bugs.
      These fixes are going out via several routes (net.bugs.4bsd, the
Berkeley buglist, and to some key individuals) because they have a marked
effect on interoperability and Internet performance.

				John Nagle
===============================================================================
Index: sys/netinet/tcp_input.c 4.3BSD-beta Fix

Description:
	TCP connections to some non-BSD systems open, but will not
	accept data from the remote system.  Known problem when
	trying to open connections to TOPS-20 systems.

	The "advertised window", tcp_adv, was not initialized during
	connection synchronization.  Also, one comparison on sequence
	numbers was made incorrectly, using a difference of unsigned 
	values, which in C is always positive(!).

					John Nagle

Repeat-By:
	Try to establish a TCP connection with a system which sets
	the high bit in the TCP sequence number.  (A 4.3BSD system
	which has been up for more than 195 days will do this, or
	you can change the initial value of tcp_iss to some value
	with the high bit set.)


Fix:
tcp_input.c
327a328,329
> 	 * Be careful with arithmetic here; differences of sequence
> 	 * numbers compare in unexpected ways.  Hence the (int) cast.
329c331
< 	tp->rcv_wnd = MAX(sbspace(&so->so_rcv), tp->rcv_adv - tp->rcv_nxt);
---
> 	tp->rcv_wnd = MAX(sbspace(&so->so_rcv),(int)(tp->rcv_adv-tp->rcv_nxt));
tcp_seq.h:
22a23
>  * Note that our rcv_adv variable needs to be initialized too.
25c26
< 	(tp)->rcv_nxt = (tp)->irs + 1
---
> 	(tp)->rcv_adv = (tp)->rcv_nxt = (tp)->irs + 1
===============================================================================
Index: ucb/netinet/tcp_timer.c 4.3BSD-beta Fix

Description:
	Excessive retransmissions on long-haul nets.  Serious congestion
	in Internet gateways.  File transfer speeds under 10% of expected
	values over 9600 baud point-to-point links.  Angry network
	managers.

	The basic machinery is right but some of the special cases
	are wrong, resulting in bad host behavior on slow links.
	Several problems combine to result in very short retransmit
	intervals:
	1) The smoothed round-trip time is zero until the first
	   successful round-trip without retransmission.  If there
	   is a retransmission of the first packet, the zero value
	   is actually used to compute the round-trip time, resulting
	   in a minumum retransmission time.

	2) The standard backoff algorithm not only backs off rather
	   slowly, but due to an incorrect calculation, the first
	   retransmit interval is 2.0*t_srtt, but the second is only
	   1.0*t_srtt, and not until retransmit #4 or so does the
	   retransmit time get back up to 2*t_srtt.  The supplied
	   "experimental" backoff algorithm backs off at rate 2**n,
	   which reduces retransmits under overload conditions.

					John Nagle

Repeat-By:
	Connect two 4.3BSD systems via a 9600 baud DMR link.  Try a
	big file transfer with ftp(I).  Be prepared for a long wait.

Fix:
tcp_timer.c
112c112
< int	tcpexprexmtbackoff = 0;
---
> int	tcpexprexmtbackoff = 1;		/* use exponential backoff if 1 */
154a155,169
> 		/*
> 		 * Calculate retransmit timer for non-first try.
> 		 * Start with the same value used for the first retransmit.
> 		 * Then use either the table tcp_backoff to scale this up
> 		 * based on the number of retransmits, or if the patchable
> 		 * flag tcpexprexmtbackoff is set, just multiply it by
> 		 * 2**number of retransmits.
> 		 * If t_srtt is zero when we get here, we have never
> 		 * had a successful round-trip and are already retransmitting,
> 		 * which indicates trouble, so we apply a larger initial guess
> 		 * for the round-trip time.  This prevents serious network 
> 		 * overload when talking to faraway hosts, especially when
> 		 * they aren't answering.
> 		*/
> 		if (tp->t_srtt == 0) tp->t_srtt = TCPTV_SRTTRTRAN;
156c171
< 		    (int)tp->t_srtt, TCPTV_MIN, TCPTV_MAX);
---
> 		    (int)(tcp_beta * tp->t_srtt), TCPTV_MIN, TCPTV_MAX);
tcp_timer.h:
60a61,62
> #define TCPTV_SRTTRTRAN ( 10*PR_SLOWHZ)	/* base roundtrip time if retran
> 						   before 1st good roundtrip */
===============================================================================



More information about the Comp.bugs.4bsd.ucb-fixes mailing list