Fortran optimization - THE ANSWER!

Fri Apr 5 16:08:03 AEST 1991

fsjohnv at alliant1.lerc.nasa.gov (Richard Mulac) writes:

>The contradiction that I found in my
>real code can be described using the following program TEST, which calls
>subroutines JOINED and SPLIT.
>
>      PROGRAM TEST
>      PARAMETER (IX=10000,JX=100,NX=100)
>      COMMON /GLOBAL/
>     .  A(IX,JX),B(IX,JX),C(IX,JX)
>      CALL JOINED
>      CALL SPLIT
>      STOP
>      END
>      SUBROUTINE JOINED
>      PARAMETER (IX=10000,JX=100,NX=100)
>      COMMON /GLOBAL/
>     .  A(IX,JX),B(IX,JX),C(IX,JX)
>      DO N=1,NX
>      DO I=1,IX
>      A(I,  1) = A(I,  1) * B(I,  1) + C(I,  1)
>      A(I,  2) = A(I,  2) * B(I,  2) + C(I,  2)
>               :
>      A(I, 99) = A(I, 99) * B(I, 99) + C(I, 99)
>      A(I,100) = A(I,100) * B(I,100) + C(I,100)
>      A(I,  1) = A(I,  1) * B(I,  1) + C(I,  1)
>      A(I,  2) = A(I,  2) * B(I,  2) + C(I,  2)
>               :
>      A(I, 99) = A(I, 99) * B(I, 99) + C(I, 99)
>      A(I,100) = A(I,100) * B(I,100) + C(I,100)
>      ENDDO
>      ENDDO
>      RETURN
>      END
>      SUBROUTINE SPLIT
>      PARAMETER (IX=10000,JX=100,NX=100)
>      COMMON /GLOBAL/
>     .  A(IX,JX),B(IX,JX),C(IX,JX)
>      DO N=1,NX
>      DO I=1,IX
>      A(I,  1) = A(I,  1) * B(I,  1) + C(I,  1)
>      A(I,  2) = A(I,  2) * B(I,  2) + C(I,  2)
>               :
>      A(I, 99) = A(I, 99) * B(I, 99) + C(I, 99)
>      A(I,100) = A(I,100) * B(I,100) + C(I,100)
>      ENDDO
>      DO I=1,IX
>      A(I,  1) = A(I,  1) * B(I,  1) + C(I,  1)
>      A(I,  2) = A(I,  2) * B(I,  2) + C(I,  2)
>               :
>      A(I, 99) = A(I, 99) * B(I, 99) + C(I, 99)
>      A(I,100) = A(I,100) * B(I,100) + C(I,100)
>      ENDDO
>      ENDDO
>      RETURN
>      END
>
>   The code was compiled with the -o aggress option which vectorized the
>inner loops in both JOINED and SPLIT.  It was expected that JOINED would run
>faster and more efficiently than SPLIT, but the opposite result occurred.
>Using the hardware performance monitor (hpm) I found that JOINED ran in 2.8
>seconds at a rate of 140 Mflops, while SPLIT ran in 2.2 seconds at a rate
>of 180 Mflops.  The results from this test program were similiar to those
>I found when combining 10 to 50 line loops with complicated right hand sides
>in the original code I was working on.
>   Several people analyzed the program and found the problem to be due to the
>large amount of address locations which are computed and saved for later use
>within each loop (i.e., JOINED).

Mulac concludes that the code is poorly written and there's is no compiler bug.
I agree that the code is poorly written, but I don't believe the cause
is addressing expressions.

Since A, B, and C all have the same dimensions,
we should be able to handle all the addressing in the inner loops
with a single index register and constant offsets.  All the addressing
expressions should be of the form

	I*8 + A + 0*100		I*8 + B + 0*100		I*8 + C + 0*100
	I*8 + A + 1*100		I*8 + B + 1*100		I*8 + C + 1*100
	...			...			...

Since I is always multiplied by 8, it should be strength reduced.
The rest of each addressing expression is constant.
So, index + constant.

If the code is indeed suffering from too many addressing expressions,
then Cray should look to their strength reduction and consider
reassociation of loop-invariants.

There are alternative explanations.

You're addressing A, B, and C with the 2nd index varying
(running from 1 to 100).  On a cache machine, this is a bad move.
I dunno about Crays.  Certainly you're causing non-unit stride accesses.

Further, if the compiler detects that A(I, 1) is unchanged
across the 99 intervening assignments, it may be trying to
hold 100 FP values in registers, optimistically awaiting
their reuse.

I would propose an alternative implementation

	DO N=1, NX
	  DO J = 1, JX
	    DO I=1, IX
	      A(I, J) = A(I, J) * B(I, J) + C(I, J)
	      A(I, J) = A(I, J) * B(I, J) + C(I, J)
	    ENDDO
	  ENDDO
	ENDDO

We would hope the compiler will achieve the code below

	DO N=1, NX
	  DO J = 1, JX
	    DO I=1, IX
	      aij = A(I, J)
	      bij = B(I, J)
	      cij = C(I, J)
	      aij = aij * bij + cij
	      aij = aij * bij + cij
	      A(I, J) = aij
	    ENDDO
	  ENDDO
	ENDDO

If you must unroll, unroll the outermost loop, giving

	DO N=1, NX, 4
	  DO J = 1, JX
	    DO I=1, IX
	      A(I, J) = A(I, J) * B(I, J) + C(I, J)
	      A(I, J) = A(I, J) * B(I, J) + C(I, J)
	      A(I, J) = A(I, J) * B(I, J) + C(I, J)
	      A(I, J) = A(I, J) * B(I, J) + C(I, J)
	      A(I, J) = A(I, J) * B(I, J) + C(I, J)
	      A(I, J) = A(I, J) * B(I, J) + C(I, J)
	      A(I, J) = A(I, J) * B(I, J) + C(I, J)
	      A(I, J) = A(I, J) * B(I, J) + C(I, J)
	    ENDDO
	  ENDDO
	ENDDO

which the compiler will probably turn into

	DO N=1, NX, 4
	  DO J = 1, JX
	    DO I=1, IX
	      aij = A(I, J)
	      bij = B(I, J)
	      cij = C(I, J)
	      aij = aij * bij + cij
	      aij = aij * bij + cij
	      aij = aij * bij + cij
	      aij = aij * bij + cij
	      aij = aij * bij + cij
	      aij = aij * bij + cij
	      aij = aij * bij + cij
	      aij = aij * bij + cij
	      A(I, J) = aij
	    ENDDO
	  ENDDO
	ENDDO

Be aware though, that unrolling the outer loop and jamming the bodies isn't
always legal or the right thing to do.

Preston Briggs