Execution time bottleneck: How to speed up execution?

Thu Feb 14 12:58:55 AEST 1991

This is ludicrious.

In article <17664:Feb1319:36:1291 at kramden.acf.nyu.edu> brnstnd at kramden.acf.nyu.edu (Dan Bernstein) writes:
 > In article <4763 at goanna.cs.rmit.oz.au> ok at goanna.cs.rmit.oz.au (Richard A. O'Keefe) writes:
(A perfectly good solution; so good that I mailed the same to the original
poster.)

What most people fail to see is that the bulk of the work in the loop is the
calculation of the exponential!  So all solutions that fail to reduce that
number of calculations are doomed to fail to reduce the execution time very
much.  (And the solution I saw that used pow rather than exp will not help
either because pow will in general use exp, or an algorithm that is roughly
equivalent.)  Now to Dan's points:
 > Some random observations, starting from this point: It's wasteful to
 > repeatedly multiply a homogeneous formula by a constant. First multiply
 > all the x's by the square root of the constant. Assuming there's space
 > available (or you can trash the original array, or you can divide
 > afterwards and not worry about roundoff error):
Considering the number of multiplications the exponential uses, this is
peanuts (what is it?  I am too lazy to look at the source, but I presume
more than 10.).
 > Next, the loops will run faster backwards on most machines.
Ah yes, going backwards getting a comparison to 0 rather than to a non-zero
value.  Useful, *if* the body of the loop is small.
 > Next, the accumulated a[j] sum might as well be kept in a register, as
 > well as y[j]:
And because the exponential is called this means one more register has
to be saved.  Question: what do we gain?
 > Under some compilers on some machines it will produce a slight speedup
 > to keep a + i and y + i in registers:
And again.
 > On the other hand, the loop should be written somewhat differently for
 > vector machines:
Ha!  And because of the call to exp most C compilers for vector machines
will not vectorize it.  You have to do quite a lot more for those.
(Yup, just two weeks ago I did some optimization on the Cray of a loop
involving logarithms.)
 > If the machine can't vectorize exp then the loop should be split still
 > differently.
Yes, loop splitting.  Do in scalar what can not be vectorized.  So now
you are going to vectorize 1-10% of the work involved.  A waste of programmers
time.

Etc.  I have seen similar other responses.  In general in this kind of cases:
*look where the time is going*.  In this example the time is going to
the exponential function.  If this function is not available in hardware it
will in general take about 10 to 20 times the time needed for a multiplication
(this is an estimate not based on fact, it is probably too low).  Even if
this function (or its computational workhorse) is available in hardware, it
takes a lot of time:
Intel 80387; F2XM1 211-476 clocks (workhorse); Motorola 68881; FETOX 466+
clocks.
So chaving off one or more subtractions, memory accesses etc. will not help
very much.  The basic idea must be to reduce the number of calculations of
exp (which the solution of Richard A. O'Keefe did).
Another thing that is worth considering is reducing the time needed to
calculate the exp.  In that case questions like argument range, required
precision etc. arise.  It might be worthwile to roll your own version of
exp (see for instance Cody&Waite, Software manual for the mathematical
functions; the title is approximate).  Especially if the required precision
is smaller this might help.

But never ever do some micro-optimization that speeds up some part of the
algorithm by a factor of perhaps 100 if it takes only a very small
percentage of the total time needed (see Amdahl's Law); that is really a
waste of programmers time.
--
dik t. winter, cwi, amsterdam, nederland
dik at cwi.nl