68881 Floating Point Speed Increase on Sun 3/60

Sun Sep 24 21:13:48 AEST 1989

>From iis!prl Sun Sep 24 12:13:05 MET 1989 remote from ethz

> One of the professors here at UMR discovered that he gained a 5 times
> performance increase on some of his floating point intensive code with
> complex numbers by using in-line expansion.

This is not a new phenomenon; I posted on the slowness of the
SunOS 4.0 maths library (v7, about issue 195-200, Subject line
is `libm for 68881 and Sun fpa is incredibly slow').

> The problem was with the *humongous* amount of stack traffic that was
> being generated by the 68881 coprocessor calls.

This is *NOT*, repeat *NOT* the problem. The problem is that
the Sun -lm maths library doesn't use any of the high-level
builtins that are available on the 68881 nor in the Sun3 FPA.

The proof of this:

1) Look at the code for sqrt(), cos, or similar from the library using
adb. You'll see that sqrt() does a coded Newton iteration and doesn't use
the builtin fsqrt operation of the 68881.  similarly, sin() and cos() do a
coded series expansion.

2) If you recode the functions as C-callable functions in assembly
language you get almost the same speedup as using the inline library.
This is partly due to poor coding in the inline library, detailed by David
Hough from Sun in response to my posting.  Unfortunately I don't have an
exact reference to David's article either.

If you want to experiment with this yourself, try the following:

main()
{
        register int i;
        register double a, b;

	/* Check that at least one result is correct */
        printf("%g\n", sqrt(2.0));

        for(i = 0, a = 0; i < 50000; i++, a += 0.00001)
                b = sqrt(a);

        exit(0);
}

and another implementation of sqrt:

        .globl _sqrt

_sqrt:
        fsqrtd  sp@(4),fp0      | Do the sqrt
        fmoved  fp0,sp at -
        movel   sp at +,d0
        movel   sp at +,d1
        rts

						User CPU time
cc sqrttst.c -lm				19.36 sec
cc sqrttst.c sqrt.s				 1.78 sec
cc sqrttst.c /usr/lib/68881/libm.il		 1.63 sec

All times on a 3/60 running 4.0.3. Be wary of compiling the test routine
with high levels of optimisation. It is not a benchmark which is designed
to be robust in the face of good global optimisation!

> The fix is to designate the inline expansion library for the 68881 on the
> compile line - 
> e.g.
> f77 -c -f68881 code.c -O code /usr/lib/68881/libm.il

This will work much better if you use:
	-O4	(provided iropt doesn't run out of stack space and dump core :-(
	/usr/lib/f68881/libm.il (typo in original article)

> The problem and solution are described in detail in section G
> (Assembly-Level In-line Expansion) of Sun's Floating-Point Programmer's
> Guide, pp. 105-113.

Um, sort of. The Sun's Floating-Point Programmer's Guide was written for
3.x, where the implementation of -lm was considerably faster. This means
that in 3.x, the speedup *is* due to reduced stack traffic, but is nowhere
near so spectacular, but in 4.x, the speedup is due to taking a completely
different implementation of the functions, and the speedups are
considerable.

Depending on the function, using the 68881 (or Sun3 FPA) implementation of
the maths library functions rather than the slow code in -lm will bring a
factor of between about 3 and 10.

For the pedantic; neither my sqrt.s implementation above nor the
implementations in /usr/lib/f68881/libm.il satisfy SVID. This is
documented (somewhere in the FM). It is relatively easy, however, to
implement a sqrt() function which both is > 10* faster than that in -lm,
and satisfies SVID.

I have had long and wearing discussions with the engineers responsible for
the Sun maths library, and they were never willing to accept the poor
performance of the standard maths library as a bug.

The performance enhancements are now registered as a Request For
Enhancement  RFE#1021706 (SO#310960). If you have a need for a faster
maths library, please contact your Sun support people and quote these
references. This may help to make the improvements happen faster.

For those of you with Sun4's, the FP chip there implements only the sqrt
function, but there is *NO* sqrt in the inline library.  Recoding the
sqrt() function for Sun4 in assembly to use the fsqrtd builtin gives you a
factor 5 speedup.

For the curious, this is how to do a faster sqrt on a Sun4, again, *not*
SVID-conformant:

        .seg    "text"
        .proc   7
        .global _sqrt
_sqrt:
        save    %sp,-72,%sp
        st      %i0,[%fp+68]
        ld      [%fp+68],%f0
        st      %i1,[%fp+72]
        ld      [%fp+72],%f1
        fsqrtd  %f0,%f0
        ret
        restore

BTW, for X11 hackers, the dreaded ARC function can be sped up by a factor
of about 2.5-3 on a Sun3 by either compiling server/ddx/mi/miarc.c with
the inline library (/usr/lib/68881/libm.il) or by using my sqrt.s hack
above!!

Peter Lamb
uucp:  uunet!mcvax!ethz!prl     eunet: prl at iis.ethz.ch    Tel:   +411 256 5241
Integrated Systems Laboratory
ETH-Zentrum, 8092 Zurich