Floating Point Types (F.3.6)

Implementation Dependency - Floating Point Types (F.3.6)

The scaled value of a floating-point constant that is in the range of the representable value for its type is controlled at compile time by the -y compiler option, and conforms to the IEEE standard for binary floating-point arithmetic. (3.1.3.1)

The following table shows the storage occupied and the range of various floating-point types (3.1.2.5):

Type	Size (bits)	Range of base 10 Exponents	Range of Decimal Values (in float.h)	Precision (decimal digits)
float	32	-37 to 38	1.175494351E-38 to 3.402823466E+38	7
double	64	-307 to 308	2.2250738585072014E-308 to 1.7976931348623157E+308	15
long double	64	-307 to 308	2.2250738585072014E-308 to 1.7976931348623157E+308	15
long double (-qldbl128 option)	128	-307 to 308	2.2250738585072014E-308 to 1.7976931348623157E+308	31

Other floating-point limits are set in the /usr/include/float.h header file, described in "Header Files Overview" in the AIX Version 4 Files Reference.

When an integral value is converted to a floating-point number that cannot exactly represent the original value, the direction of truncation depends on the compile-time rounding mode set by the -y compiler option. (3.2.1.3)

When a floating-point number is converted to a narrower floating-point number, the direction of truncation or rounding depends on the rounding mode set by the -y compiler option. (3.2.1.4)

Using 16-byte long doubles (-qldbl128 Option)
The mathematical functions contained in the <math.h> header file, such as cosl, tanl, fmodl, have been updated to work with 16-byte long double floating-point numbers.

The input/output functions in the <stdio.h> header file, such as printf, scanf, vsprintf, have been updated to work with 16-byte long double floating-point numbers.

A new function atold, which converts a string to a long double representation, has been added to the existing string-to-number functions strtod, strtol, and strtoul.

Alignment Rules
If the first member of a union or structure is a long double, the aggregate is aligned on a 128-bit boundary. Other aggregates and long double identifiers are aligned on a 32-bit boundary. If -qalign=natural is specified, all long doubles are aligned on a 128-bit boundary, regardless of their placement in a union or structure.

The 2-byte alignment rules remain unchanged. All identifiers and aggregates are aligned on a 16-bit boundary.

For bind-time type checking, a long double has type r16.

There are three floating types: float, double, and long double. The range of values of each type is a subrange of the values of the next type in the list.

When the compiler converts a value of floating type to integral type, the fractional part is discarded. If the value of the integral part is too large to be represented by an integral type, the value is converted to the maximum value of an integral type.

When a long double is demoted to double or float, if the value being converted is in the range of values that can be represented but cannot be represented exactly, the result is either the nearest higher or nearest lower value, depending on the IEEE rounding mode.

Implementation-Defined Behavior
Implementation Dependencies
align Compiler Option
ldbl128 Compiler Option
y Compiler Option