IA-64 Floating-Point Operations and the IEEE Standard for Binary Floating-Point Arithmetic

This paper examines the implementation of floating-point operations in the IA-64 architecture from the perspective of the IEEE Standard for Binary Floating-Point Arithmetic [1]. The floating-point data formats, operations, and special values are compared with the mandatory or recommended ones from the IEEE Standard, showing the potential gains in performance that result from specific choices. Two subsections are dedicated to the floating-point divide, remainder, and square root operations, which are implemented in software. It is shown how IEEE compliance was achieved using new IA-64 features such as fused multiply-add operations, predication, and multiple status fields for IEEE status flags. Derived integer operations (the integer divide and remainder) are also illustrated. IA-64 floating-point exceptions and traps are described, including the Software Assistance faults and traps that can lead to further IEEE-defined exceptions. The software extensions to the hardware needed to comply with the IEEE Standard’s recommendations in handling floating-point exceptions are specified. The special case of the Single Instruction Multiple Data (SIMD) instructions is described. Finally, a subsection is dedicated to speculation, a new feature in IA processors. INTRODUCTION The IA-64 floating-point architecture was designed with three objectives in mind. First, it was meant to allow high-performance computations. This was achieved through a number of architectural features. Pipelined floating-point units allow several operations to take place in parallel. Special instructions were added, such as fused floating-point multiply-add, or SIMD instructions, which allow the processing of two subsets of floating-point operands in parallel. Predication allows skipping operations without taking a branch. Speculation allows speculative execution chains whose results are committed only if needed. In addition, a large floating-point register file (including a rotating subset) reduces the number of save/restore operations involving memory. The rotating subset of the floating-point register file enables software pipelining of loops, leading to significant gains in performance. Second, the architecture aims to provide high floatingpoint accuracy. For this, several floating-point data types were provided, and instructions new to the Intel architecture, such as the fused floating-point multiplyadd, were introduced. Third, compliance with the IEEE Standard for Binary Floating-Point Arithmetic was sought. The environment that a numeric software programmer sees complies with the IEEE Standard and most of its recommendations as a combination of hardware and software, as explained further in this paper. Floating-Point Numbers Floating-point numbers are represented as a concatenation of a sign bit, an M-bit exponent field, and an N-bit significand field. In some floating-point formats, the most significant bit (integer bit) of the significand is not represented. Its assumed value is 1, except for denormal numbers, whose most significant bit of the significand is 0. Mathematically f = σ ⋅ s ⋅ 2 where σ = ±1, s ∈ [1,2), s = 1 + k / 2 , k ∈ {0, 1, 2,..., 2 -1}, e ∈ [emin, emax] ∩ Z (Z is the set of integers), emin = 2 + 2, and emax = 2 M-1 – 1. The IA-64 architecture provides 128 82-bit floating-point registers that can hold floating-point values in various formats, and which can be addressed in any order. Intel Technology Journal Q4, 1999 IA-64 Floating-Point Operations and the IEEE Standard for Binary Floating-Point Arithmetic 2 Floating-point numbers can also be stored into or loaded from memory. IA-64 FORMATS, CONTROL, AND STATUS Formats Three floating-point formats described in the IEEE Standard are implemented as required: single precision (M=8, N=24), double precision (M=11, N=53), and double-extended precision (M=15, N=64). These are the formats usually accessible to a high-level language numeric programmer. The architecture provides for several more formats, listed in Table 1, that can be used by compilers or assembly code writers, some of which employ the 17-bit exponent range and 64-bit significands allowed by the floating-point register format. Format Format Parameters Single precision M=8, N=24 Double precision M=11, N=53 Double-extended precision M=15, N=64 Pair of single precision floating-point numbers M=8, N=24 IA-32 register stack single precision M=15, N=24 IA-32 register stack double precision M=15, N=53 IA-32 double-extended precision M=15, N=64 Full register file single precision M=17, N=24 Full register file double precision M=17, N=53 Full register file double-extended precision M=17, N=64 Table 1: IA-64 floating-point formats The floating-point format used in a given computation is determined by the floating-point instruction (some instructions have a precision control completer pc specifying a static precision) or by the precision control field (pc), and by the widest-range exponent (wre) bit in the Floating-Point Status Register (FPSR). In memory, floating-point numbers can only be stored in single precision, double precision, double-extended precision, and register file format (‘spilled’ as a 128-bit entity, containing the value of the floating-point register in the lower 82 bits). Rounding The four IEEE rounding modes are supported: rounding to nearest, rounding to negative infinity, rounding to positive infinity, and rounding to zero. Some instructions have the option of using a static rounding mode. For example, fcvt.fx.trunc performs conversion of a floating-point number to integer using rounding to zero. Some of the basic operations specified by the IEEE Standard (divide, remainder, and square root) as well as other derived operations are implemented using sequences of add, subtract, multiply, or fused multiplyadd and multiply-subtract operations. In order to determine whether a given computation yields the correctly rounded result in any rounding mode, as specified by the standard, the error that occurs due to rounding has to be evaluated. Two measures are commonly used for this purpose. The first is the error of an approximation with respect to the exact result, expressed in fractions of an ulp, or unit in the last place. Let FN be the set of floating-point numbers with N-bit significands and unlimited exponent range. For the floating-point number f = σ ⋅ s ⋅ 2 ∈ FN, one ulp has the magnitude 1 ulp = 2 An alternative is to use the relative error. If the real number x is approximated by the floating-point number a, then the relative error ε is determined by