Need info on IEEE quad format

Prescott K. Turner turner at sdti.UUCP
Tue Sep 20 02:11:36 AEST 1988


In article <660016 at hpclscu.HP.COM>, shankar at hpclscu.HP.COM (Shankar Unni)
writes:
>I need some info on IEEE floating point representation limits to construct
>a <float.h> file for ANSI C. I already have the info for single and double
>floats:
>...
Here are some improvements to your single and double values:
     #define	FLT_ROUNDS	1 /* is consistent with tie-breaking */
                                  /* to nearest even significand */
    
     #define	DBL_MIN_EXP	-1021 /* Because C uses an different */
     #define	DBL_MAX_EXP	1024  /* (inferior) model for floating */ 
     #define	FLT_MIN_EXP	-125  /* point numbers from IEEE 754, its */
     #define	FLT_MAX_EXP	128   /* MIN_EXP and MAX_EXP values are */
                                      /* different. */

     #define	DBL_MIN		2.2250738585072014e-308 /* more accurate */
     #define	DBL_MAX		1.7976931348623157e+308 /* than the latest */
                                                        /* draft C standard */

>The information I need is (for quad-precision (128-bit) floats):
The draft C standard does not provide the figures for IEEE quad-precision
because the IEEE 754 standard prescribes only lower limits for range and
precision of a 'double extended' format.  I will attempt to fill in your
table, based on the quad format which appeared in an early draft of the IEEE
standard, and which is supported by Intel coprocessors. 

     #define	LDBL_MANT_DIG	112  /* no hidden bit */
     #define	LDBL_EPSILON	3.851859888774471706111955885169855E-34L
     #define	LDBL_DIG	33
     #define	LDBL_MIN_EXP	-16381
     #define	LDBL_MIN        3.362103143112093506262677817321753E-4932L
     #define	LDBL_MIN_10_EXP -4931
     #define	LDBL_MAX_EXP	16384
     #define	LDBL_MAX	1.1897314953572317650857593266280069E+4932L
     #define	LDBL_MAX_10_EXP 4932

>The magnitude of the smallest de-normalized quad would also be useful...
     #define    LDBL_DENORM_MIN 1E-4965L
No need for lots of digits here, because the smallest denormalized number has
only 1 bit of precision.

Caveat: The IEEE standard has strict requirements on decimal-to-binary
conversion for single and double, but even there it permits a little slack in
converting the _MAX and _MIN constants.  You're lucky if you have a
decimal-to-binary conversion routine which will convert the above
representations of LDBL_MAX and LDBL_MIN to the appropriate binary values.
You could even get overflow.  If there is a problem, it's more important that
the constants convert correctly than that they themselves be accurate.

Note that the C standard permits the macro names to be defined as expressions.
Here's an idea for what might work:
     #define    FLT_MAX         (ldexp(1-6E-8, FLT_MAX_EXP))
     #define    FLT_MIN         (ldexp(0.5, FLT_MIN_EXP))
     #define    DBL_MAX         (ldexp(1-1E-16, DBL_MAX_EXP))
     #define    DBL_MIN         (ldexp(0.5, DBL_MIN_EXP))
     #define    LDBL_MAX        (ldexpl(1-2E-34L,LDBL_MAX_EXP))
     #define    LDBL_MIN        (ldexpl(0.5L, LDBL_MIN_EXP))
--
Prescott K. Turner, Jr.
Software Development Technologies, Inc.
375 Dutton Rd., Sudbury, MA 01776 USA        (508) 443-5779
UUCP:...genrad!mrst!sdti!turner



More information about the Comp.lang.c mailing list