Preliminary Results on Math Library Tests
Booker Bense
benseb at nic.cerf.net
Wed Feb 13 07:55:25 AEST 1991
A while ago I posted some remarks about math libraries and decided
they were rather hasty and I should problably get some facts. When
I have completed the benchmark I will post the code that was used to
obtain it.
I have obtained some interesting preliminary results and have
some retractions to make. First as far as I can tell NAG does not use
BLAS in any form. Examining the loadmaps from the code that ran these
tests reveals that the NAG routines are largely self-contained, the
only calls they make are to error handling and machine constant
routines. IMSL uses BLAS level 1 calls from the system libraries and
has it's own version of some BLAS level 2 routines ( SGEMV in this
example). These times are determined by querying the hardware
performance monitor before and after the subroutine call. The test
matrices in this case were the best possible case i.e:
cond(A) ~= 1
A(i,i) > A(i,j) i != j.
Each routine returned results accurate to machine precision.
More difficult cases will be included in the final version.
SGEFA - CRI libsci optimized version of linpack routines
FO1BTF - Nag Mark 13 ( References an algorithm by Croz,Nugent,Reid & Taylor )
LFTRG - IMSLmath version 10.0 ( Uses linpack Algorithm )
GENERIC - fortran linpack complied with vector optimization on.
All units are in Mflops/second. A = A(size,size)
Size 101 203 407 815
SGEFA 99.955 131.174 148.675 158.382
FO1BTF 77.289 105.933 131.063 146.328
LFTRG 72.544 156.559 218.848 257.777
The next set of results is from forcing IMSL to use the libsci
version of SGEMV.
Size 101 203 407 815
SGEFA 97.777 130.377 149.025 157.939
FO1BTF 72.429 108.292 132.440 147.396
LFTRG 105.384 213.625 255.089 289.730
This result is from a run using generic fortran BLAS and Linpack routines.
Size 101 203 407 815
GENERIC 35.94 64.359 96.345 136.265
- The mflops rates are all from a running on 1 cpu of an 8 cpu YMP in
multi-user mode (UNICOS 5.1) i.e. around 0% idle time.I would say that
the results have a repeatablity of around 5% with results from the
small sizes being more repeatable. Due to the way the YMP memory is
organized, memory fetchs are a function of system load and the larger
problems are more affected by this.
-Conclusions:
1. It pays to read the loadmap, the only difference between run 1 and
run2 was in the load command.
1: segldr -limslmath,nag *.o
2: segldr -lsci,imslmath,nag *.o
2. These are only best case results. I wanted to find out the the
fastest possible speed for these routines. The routines in question
are the simplest possible, in a real problem you would probably want
to use the more sophisticated versions and do some checking on the
condition number before you believe the results.
3. Imsl is alot faster than I would have expected, I thought the
speeds for the SGEFA would be consistently faster that either IMSL or
NAG. 290 Mflops is as fast as any code I've run on a single processor,
330 is the speed you're guaranteed never to exceed. The algorithm
quoted in the Nag reference manual is one designed for pageing
machines, I don't know how much they massaged it for the YMP. All of
these numbers do reflect some effort at machine optimization ( compare
with generic ).
4. Subroutine calls are expensive, the large difference between the
generic version and the libsci version is can in part be explained by
increased number of subroutine calls. The libsci versions of both SGEMV
and SGEFA have had almost all of their subroutine calls inlined. As
the size of the problem becomes larger the generic version approaches
the optimized version because the subroutine overhead is roughly
linear in the problem size while the number of required flops is
cubic. This also explains the large difference between imsl with and without
the libsci SGEMV for small problems.
- Booker C. Bense
/* benseb at grumpy.sdsc.edu */
More information about the Comp.unix.cray
mailing list