FFT's on 4D/2XX systems

Sam Fulcomer sgf at cs.brown.edu
Thu Feb 15 03:56:04 AEST 1990


In article <9002122052.AA21651 at snow-white.merit-tech.com> goss at SNOW-WHITE.MERIT-TECH.COM (Mike Goss) writes:
>In reply to the message from Tom Reed:
>> I'm looking for any FFT software that is available and runs on the
>> 4D/2XX products. The faster the better especially if it is parallel code or
>
>The book "Numerical Recipes in C" (also available in FORTRAN and Pascal
>versions) has several good FFT routines, although not in a parallelized form.

Well, _Numerical_Recipes_ is ok, and I haven't bothered to try to p'ize the
f77 codes yet, however it might be worthwhile (I haven't poked them much).
It's quite possible that PFA won't like them much. Many numerical packages
(IMSL in particular) aren't very adaptable to parallel arches.

Another problem with all current (although NAG is working on it, as may be 
others) numerical packages is that they are not optimized for big-memory 
problems on cache machines (ie, as matrix size goes up data cache hits go
down, as does performance). Algorithms optimized for processing address-regions
of data in blocks are the solution to this problem (although monster data 
caches are another). 

The important thing to understand when trying to get performance out of a 
multi-proc SGI is to exactly typify the use which it's seeing when you want
the performance. Parallelized code will run well (on a 4-proc system) if it is 
the only (or nearly only) thing running on the system. If you've got 2 of the 
beasts running you _may_ still be getting better than single proc performance, 
but don't bet on it. Don't even bother running if you don't have (effectively)
2 idle processors. 

I haven't bothered using the PFA since we typically have 2 or 3 things going
on at any given time on our 4D/240GTX (64MB) with someone running 4Sight.
My experience with it has been limited to bitching at people who've run
multi-proc jobs on a busy system (and helping them PFA their code).

I am very pleased with the things performance on single proc jobs, though. On 
an idle system the machine will run 4 copies of the same computation in the 
same time that only one takes (wall clock). A one-processor job (heavy FPU)
seems to take about 2-3 times as much CPU time as on a 3090 with vector proc
(the program vectorized on the 3090).



More information about the Comp.sys.sgi mailing list