Preliminary Results on Math Library Tests

Wed Feb 13 07:55:25 AEST 1991

A while ago I posted some remarks about math libraries and decided
they were rather hasty and I should problably get some facts. When
I have completed the benchmark I will post the code that was used to
obtain it. 

	I have obtained some interesting preliminary results and have
some retractions to make. First as far as I can tell NAG does not use
BLAS in any form. Examining the loadmaps from the code that ran these
tests reveals that the NAG routines are largely self-contained, the
only calls they make are to error handling and machine constant
routines. IMSL uses BLAS level 1 calls from the system libraries and
has it's own version of some BLAS level 2 routines ( SGEMV in this
example). These times are determined by querying the hardware
performance monitor before and after the subroutine call. The test
matrices in this case were the best possible case i.e:

	 	cond(A) ~= 1
		A(i,i) > A(i,j) i != j.  

Each routine returned results accurate to machine precision.
More difficult cases will be included in the final version.

SGEFA - CRI libsci optimized version of linpack routines
FO1BTF - Nag Mark 13 ( References an algorithm by Croz,Nugent,Reid & Taylor )
LFTRG  - IMSLmath version 10.0 ( Uses linpack Algorithm )
GENERIC - fortran linpack complied with vector optimization on.

All units are in Mflops/second. A = A(size,size)

Size        	 101	  203	     407	  815

SGEFA 		99.955 	131.174    148.675 	158.382 

FO1BTF		77.289 	105.933    131.063 	146.328 

LFTRG          	72.544 	156.559    218.848 	257.777 

The next set of results is from forcing IMSL to use the libsci
version of SGEMV.

Size        	 101	  203	     407	  815

SGEFA 		97.777 	130.377    149.025 	157.939 

FO1BTF		72.429 	108.292    132.440 	147.396 

LFTRG          	105.384 213.625    255.089 	289.730

This result is from a run using generic fortran BLAS and Linpack routines.

Size        	 101	  203	     407	  815

GENERIC         35.94    64.359     96.345      136.265

- The mflops rates are all from a running on 1 cpu of an 8 cpu YMP in
multi-user mode (UNICOS 5.1) i.e. around 0% idle time.I would say that
the results have a repeatablity of around 5% with results from the
small sizes being more repeatable. Due to the way the YMP memory is
organized, memory fetchs are a function of system load and the larger
problems are more affected by this.

-Conclusions:

1. It pays to read the loadmap, the only difference between run 1 and
run2 was in the load command.

	   1:  segldr -limslmath,nag *.o
	   2:  segldr -lsci,imslmath,nag *.o

2. These are only best case results. I wanted to find out the the
fastest possible speed for these routines. The routines in question
are the simplest possible, in a real problem you would probably want
to use the more sophisticated versions and do some checking on the
condition number before you believe the results.

3. Imsl is alot faster than I would have expected, I thought the
speeds for the SGEFA would be consistently faster that either IMSL or
NAG. 290 Mflops is as fast as any code I've run on a single processor,
330 is the speed you're guaranteed never to exceed. The algorithm
quoted in the Nag reference manual is one designed for pageing
machines, I don't know how much they massaged it for the YMP. All of
these numbers do reflect some effort at machine optimization ( compare
with generic ).

4. Subroutine calls are expensive, the large difference between the
generic version and the libsci version is can in part be explained by
increased number of subroutine calls. The libsci versions of both SGEMV
and SGEFA have had almost all of their subroutine calls inlined. As
the size of the problem becomes larger the generic version approaches
the optimized version because the subroutine overhead is roughly
linear in the problem size while the number of required flops is	
cubic.  This also explains the large difference between imsl with and without
the libsci SGEMV for small problems.

- Booker C. Bense
/* benseb at grumpy.sdsc.edu */