Need info on IEEE quad format
Prescott K. Turner
turner at sdti.UUCP
Tue Sep 20 02:11:36 AEST 1988
In article <660016 at hpclscu.HP.COM>, shankar at hpclscu.HP.COM (Shankar Unni)
writes:
>I need some info on IEEE floating point representation limits to construct
>a <float.h> file for ANSI C. I already have the info for single and double
>floats:
>...
Here are some improvements to your single and double values:
#define FLT_ROUNDS 1 /* is consistent with tie-breaking */
/* to nearest even significand */
#define DBL_MIN_EXP -1021 /* Because C uses an different */
#define DBL_MAX_EXP 1024 /* (inferior) model for floating */
#define FLT_MIN_EXP -125 /* point numbers from IEEE 754, its */
#define FLT_MAX_EXP 128 /* MIN_EXP and MAX_EXP values are */
/* different. */
#define DBL_MIN 2.2250738585072014e-308 /* more accurate */
#define DBL_MAX 1.7976931348623157e+308 /* than the latest */
/* draft C standard */
>The information I need is (for quad-precision (128-bit) floats):
The draft C standard does not provide the figures for IEEE quad-precision
because the IEEE 754 standard prescribes only lower limits for range and
precision of a 'double extended' format. I will attempt to fill in your
table, based on the quad format which appeared in an early draft of the IEEE
standard, and which is supported by Intel coprocessors.
#define LDBL_MANT_DIG 112 /* no hidden bit */
#define LDBL_EPSILON 3.851859888774471706111955885169855E-34L
#define LDBL_DIG 33
#define LDBL_MIN_EXP -16381
#define LDBL_MIN 3.362103143112093506262677817321753E-4932L
#define LDBL_MIN_10_EXP -4931
#define LDBL_MAX_EXP 16384
#define LDBL_MAX 1.1897314953572317650857593266280069E+4932L
#define LDBL_MAX_10_EXP 4932
>The magnitude of the smallest de-normalized quad would also be useful...
#define LDBL_DENORM_MIN 1E-4965L
No need for lots of digits here, because the smallest denormalized number has
only 1 bit of precision.
Caveat: The IEEE standard has strict requirements on decimal-to-binary
conversion for single and double, but even there it permits a little slack in
converting the _MAX and _MIN constants. You're lucky if you have a
decimal-to-binary conversion routine which will convert the above
representations of LDBL_MAX and LDBL_MIN to the appropriate binary values.
You could even get overflow. If there is a problem, it's more important that
the constants convert correctly than that they themselves be accurate.
Note that the C standard permits the macro names to be defined as expressions.
Here's an idea for what might work:
#define FLT_MAX (ldexp(1-6E-8, FLT_MAX_EXP))
#define FLT_MIN (ldexp(0.5, FLT_MIN_EXP))
#define DBL_MAX (ldexp(1-1E-16, DBL_MAX_EXP))
#define DBL_MIN (ldexp(0.5, DBL_MIN_EXP))
#define LDBL_MAX (ldexpl(1-2E-34L,LDBL_MAX_EXP))
#define LDBL_MIN (ldexpl(0.5L, LDBL_MIN_EXP))
--
Prescott K. Turner, Jr.
Software Development Technologies, Inc.
375 Dutton Rd., Sudbury, MA 01776 USA (508) 443-5779
UUCP:...genrad!mrst!sdti!turner
More information about the Comp.lang.c
mailing list