Execution time bottleneck: How to speed up execution?

Fri Feb 15 06:36:22 AEST 1991

In article <21658 at yunexus.YorkU.CA> racine at yunexus.yorku.ca (Jeff Racine) 
writes:

   Execution time query:

   I have the following nested loop in which I must sum (over all i), for each
   x[j], e raised to the power x[i]-x[j] squared (times a constant).

      for (j=1; j <= n; j++) {
	 sum = 0.0;
	 for (i=1; i<= n; i++) {
	    sum += exp( con * (x[i]-x[j]) * (x[i]-x[j]) );
	 }
	 a[j] = k*sum;
      }

Jeff needs to speed it up. The C guys will tell you how to speed up
the loops; being a physicist, ie really just a FORTRAN programmer, I
would just point out to halve the number of them and maybe take out a
multiply. With minimal recoding maybe something along the lines of:
 /*	Just typed in - maybe typos */

      foo = sqrt(con > 0 ? con : -con);
      for (j = 1; j <= n; ++j) {
        a[j] = 1.0;                        /* i = j case */
        y[j] = x[j] * foo;
      }

      if ( con > 0.0 ) 
        for (j = 1; j < n; ++j) 
    	  for (i = j + 1; i <= n; ++i) {
	    a[j] += foo = exp( (y[i]-y[j]) * (y[i]-y[j]) );
	    a[i] += foo;                   /* Two sums per exp */
	  }
      else                     
        for (j = 1; j < n; ++j) 
	  for (i = j + 1; i <= n; ++i) {
	    a[j] += foo = exp( - (y[i]-y[j]) * (y[i]-y[j]) );
	    a[i] += foo; /* NB ^ only difference.*/
	 }

      for (j=1; j <= n; ++j) a[j] *= k;

[pious platitudes about first improving the algorithm, taking
advantage of symmetry, etc. omitted]. I don't know your machine but
it's the halving of the number of exponentials that I would expect to
give you the greatest gain, 50%. If you've got a modern, floating
point orientated RISC machine (RS6000, i860??) floating point
multiplies may be so fast it's not it may not be worth the hassle of
the vector y and the if ( con < 0.0 ) bit. Any FORTRAN compiler would
optimise out the (y[i]-y[j]) * (y[i]-y[j]) in its sleep, so should
your C compiler. If not you could do it by hand as I see someone has
sugested, but that's *probably* even less of a gain. I would guess
that the question of c v assembler hangs upon how long your cpu takes
to do a floating point multiply and exponential against the quality of
your compiler's optimiser.

Halving the number of exponentials is only a factor of two but it's
zero effort and it's always there however you end up coding it.

But I may have missed really something obvious as to why you can't do
that; I had a late night this morning.

Good luck and hope this helps.

John Rowe
Exeter University Computational Physics Group
Exeter
U.K.