Summary on IEEE error handler for SUN FORTRAN (L

Fri Jan 25 06:50:46 AEST 1991

This is a summary to the following question I posted to the net yesterday.
I got solutions/suggestions from the following netters. 

khb at Eng.Sun.COM 
borcherb at turing.cs.rpi.edu 
mckie at sky.arc.nasa.gov
larry at pylos.cchem.berkeley.edu 
carlo at nu.uchicago.edu 

As it seems a rather popular problem among the number crunchers.  I
summarize the answers in the rest of this message.

Thanks to all of you who kindly responded my question. I appriciate it.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%    Yun Fei Zhang             %    E-mail:                                %
%    Astronomy Department      %       SPAN: east::"zhang at buast0.bu.edu"   %
%    Boston University         %     BITNET: zhang at buasta                  %
%    725 Commonwealth Ave.     %     INTnet: zhang at buast0.bu.edu           %
%    Boston, MA 02215          %             zhang at bu-ast.bu.edu           %
%--------------------------------------------------------------------------%
%                          TEL: (617)-353-8917                             %
%                        TELEX: 95-1289 BOS UNIV BSN                       %
%                      TELEFAX: (617)-353-5704                             %
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
-----------------------------------------------------------------------------
ORIGINAL QUESTION POSTED:

I have a question about the error handler of SUN FORTRAN. The question is
how to locate the location where an arithmetic error occurs in a program.
This is especially helpful as I am writing a computation-intensive code.
On VAX/VMS machines, the code will crash when it encounter these
arithmetic error and tell the user where it occures. However, on SUNs, it
only shows a message at the end of the job says something like the
following:

>  Warning: the following IEEE floating-point arithmetic exceptions 
>  occurred in this program and were never cleared: 
>  Inexact;  Division by Zero;  Invalid Operand; .....

My question is how the determine the point(s) in the code where such
indicated arithmetic exceptions happened. Try to modifying the IEEE error
handler (e.g.  sigfpe_ieee, etc.) seems a possible approach. But it
involves changes in these lower level routine, which I am reluctant to
try. Is there any other option I can have to archive the some goal? (e.g.,
compiletion/linking options or software tools). 

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

SOLUTIONS:

khb at Eng.Sun.COM pointed the correct direction for a solution in the first
respons received: 

>From f77 code, as mentioned in the Numerical Computation Guide, and
the Fortran User's Guide

	i = ieee_handler("set","common",%val(2)) ! aka SIGFPE_ABORT
				                 ! if you use the .h
						 ! file with mathincludes

will cause execution to stop on divide by zero, operations on NaNs etc. If
you want to catch inexact, you can do that too (ask for it by name). 

####

And borcherb at turing.cs.rpi.edu point out the followings:

Try man f77_ieee_environment for documentation on how to do this.
Unfortunately, I believe that on SUN-4's, the handler doesn't actually
report the address at which the exception happened.  I'm told that this is
because of the pipelined nature of the SPARC processor.  At any rate, I
was unable to get this working on a SUN-4.  However, my code does stop the
program as soon as the SIGFPE signal is sent to it.

####

% The most pragmatic approach, I think, is from mckie at sky.arc.nasa.gov as
% he wrote:

On our Sun, there were so many people who had the same question as you
about how to find ieee errors in a fortran program that I set up a man
page to try to explain it.  I'll include a part of that man page below.
It seems a bit strange in comparison to your vms experiences, but after
you've done it a few times, it's not too bad.  And the ieee approach is
more flexible & more under your control.

-Bill McKie
NASA Ames Research Center
mckie at sky.arc.nasa.gov

 =============================================================

SYNOPSIS

The following is an abbreviated description of how to use the Sun DBX
debugger to find where floating point errors are occurring in a fortran
program.

DESCRIPTION

     Step 1.

     Add the following statements to the program's main module:

         external handler

         call ieee_handler('set','common',handler)

     The "external" statement is a declaration, and should appear
     in  the preliminary non-executable statements section of the
     main program source code.  The call to "ieee_handler" should
     be  placed into the main program as one of the first execut-
     able statements.

     Step 2.

     Add the following "stub"  subroutine  to  the  main  program
     source code:

         subroutine handler(i1,i2,i3,i4)
         end

     The subroutine name "handler" is arbitrary, but must be  the
     same  in  the  subroutine statement, the external statement,
     and the call ieee_handler statement's 3rd argument.

     Step 3.

     Compile the program as usual, but everywhere the f77 command
     is used, include the "-g" option in the f77 command line.

     E.g. if the program is entirely in the file prog.f  (includ-
     ing  the  handler  subroutine),  then the following could be
     used:

         f77 -g -o prog prog.f

     Step 4.

     The program is now ready to run, and it could simply be  run
     as  usual  using  the  "prog" command.  However, to find the
     place where a floating point error  is  occurring,  the  dbx
     debugger  utility  is  used to control execution of the pro-
     gram.  This is how it is run:

         dbx
         dbx> debug prog
         dbx> catch FPE
         dbx> run
         signal FPE in <routine name> at line n in file <file_name>
         dbx> quit

     The "dbx>" are prompts from the debugger.  The line  follow-
     ing  the  "run"  command is output by the debugger, and is a
     clue as to where the error occurred.

     Step 5.

     Edit the file <file_name> and move to line n  to  see  where
     the error was occurring.

SEE ALSO
     dbx(1)  dbxtool(1)  f77(1)

LIMITATIONS
     The above description demonstrates only a  small  subset  of
     the  dbx debugger's capabilities.  See the dbx user's manual
     for more information on what dbx can do.

####

% larry at pylos.cchem.berkeley.edu contrbuted another way around as:

I'm not sure that I understand what you are asking, but here is the code
that I use to make the default behavior similar to what you describe - die
on divide by zero, etc. It requires the use of the C-preprocessor on a Sun
to make it a 'compile time option'. 

#if ERROR && SUN
#include <f77/f77_floatingpoint.h>

        external error_handler
        integer ieeer,ieee_handler,error_handler
#endif

cccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccc

c set up the error handler to barf on all exceptions and then clean
c the inexact exceptions which occur with any floating point operation
c these should be the first exectuable line
 in your code.

#if SUN && ERROR

        ieeer=ieee_handler('set','all',error_handler)
        ieeer=ieee_handler('clear','inexact',error_handler)

#endif

c a separate function

#if SUN && ERROR

c error handler that is called by the IEEE package on the sun

        integer function error_handler(sig,code,sigcontext)
        integer sig,code,sigcontext(5)
        character label*16

        if (loc(code).eq.212) label='overflow'
        if (loc(code).eq.208) label='invalid'
        if (loc(code).eq.204) label='underflow'
        if (loc(code).eq.200) label='division'
        if (loc(code).eq.196) label='inexact'

        write(*,*) 'IEEE exception code ',loc(code),
     2  ' ( ',label(1:lnblnk(label)),' ) occured at pc ',sigcontext(4)

c any error processing can be done here. I just choose to kill the
c program gracefully

        call abort(' IEEE exception code - Program Halted')

        stop
        end
#endif

########

% carlo at nu.uchicago.edu shows me a similar wit, which can track the call
routine to the break point:

Hi.  The following method is the one that I have had to resort to.  It's a
bit of a kludge, but at least it allows identification of the routine
within which the exception occurred, and the problem can usually be
identified using dbx:

      program foobawooba

      external handler

      common /debug1/nlevel
      common /debug2/ stack(20)
      character stack*25
      data nlevel/1/stack/'main',19*''/

      ieeer=ieee_handler('set','common',handler)

c 'common' handles invalid, overflow, and division exceptions --- see
c "man ieee_handler".  handler is an external routine shown at the end
c of this example.  This line will call handler whenever a 'common'
c exception occurs.

[     some code here]
      call haha1
[      more code]
      stop
      end

      subroutine haha1
      common /debug1/nlevel
      common /debug2/ stack(20)
      .
      call haha2
      .
      stack(nlevel)=''
      nlevel=nlevel-1
      return
      end

      subroutine haha2
      common /debug1/nlevel
      common /debug2/ stack(20)
      character stack*25
      nlevel=nlevel+1
      stack(nlevel)='haha2'
      .
      stack(nlevel)=''
      nlevel=nlevel-1
      return
      end

      integer function handler ( sig, code, sigcontext )
      common /debug1/nlevel
      common /debug2/ stack(20)
      character stack*25
      integer sig
      integer code
      integer sigcontext(5)

      write(6,*) 'Bomb!  Here comes a stack dump:'
      do 1 i=1,nlevel
        write(6,*) stack(i)
1     continue
      write(6,*) 'Number of levels:',nlevel
      call abort
      end

The effect of all these (admittedly ugly and machine specific) gymnastics
is that the routine in which the exception occurred is pinpointed by the
array 'stack' and the variable 'nlevel'.  Since execution is halted by
means of 'call abort', all the debugging information is still available,
and the problem may be identified (if the debugger was active) by
examining the guilty routine.  The ass paining part of all this is that
the lines affecting 'stack' and 'nlevel' must be included in every routine
in the program.  I'm not that happy with it, but it's the best I've been
able to do.  One might wish that it would dawn on  the *%&#!!!? C
programmers who developed Fortran for the Sun that for the purposes of
scientific programming, failure to *automatically* halt on division by
zero is a bug, not a feature :-(> .

I hope this helps.

Carlo Graziani

#####