A mixed-precision fused multiply and add

The floating-point fused multiply and add, computing R=AB+C with a single rounding, is now an IEEE-754 standard operator. This article investigates variants in which the addend C and the result R are of a larger format, for instance binary64 (double precision), while the multiplier inputs A and B are of a smaller format, for instance binary32 (single precision). Like the standard FMA operator, the proposed mixed-precision operator computes AB+C with a single rounding, and fully support subnormals. With minor modifications, it is also able to perform the standard FMA in the smaller format, and the standard addition in the larger format.

[1]  Javier D. Bruguera,et al.  Floating-point fused multiply-add: reduced latency for floating-point addition , 2005, 17th IEEE Symposium on Computer Arithmetic (ARITH'05).

[2]  P.-M. Seidel Multiple path IEEE floating-point fused multiply-add , 2003, 2003 46th Midwest Symposium on Circuits and Systems.

[3]  William J. Dally,et al.  Register organization for media processing , 2000, Proceedings Sixth International Symposium on High-Performance Computer Architecture. HPCA-6 (Cat. No.PR00550).

[4]  Ulrich W. Kulisch,et al.  Advanced Arithmetic for the Digital Computer, Design of Arithmetic Units , 2002, RealComp.

[5]  William R. Dieter,et al.  Low-Cost Microarchitectural Support for Improved Floating-Point Accuracy , 2007, IEEE Computer Architecture Letters.

[6]  David R. Lutz Fused Multiply-Add Microarchitecture Comprising Separate Early-Normalizing Multiply and Add Pipelines , 2011, 2011 IEEE 20th Symposium on Computer Arithmetic.

[7]  Siegfried M. Rump,et al.  Accurate Floating-Point Summation Part I: Faithful Rounding , 2008, SIAM J. Sci. Comput..

[8]  Vincent Lefèvre,et al.  MPFR: A multiple-precision binary floating-point library with correct rounding , 2007, TOMS.

[9]  James Demmel,et al.  IEEE Standard for Floating-Point Arithmetic , 2008 .

[10]  A. Neumaier Rundungsfehleranalyse einiger Verfahren zur Summation endlicher Summen , 1974 .

[11]  M. Pichat,et al.  Correction d'une somme en arithmetique a virgule flottante , 1972 .

[12]  Li Shen,et al.  A New Architecture For Multiple-Precision Floating-Point Multiply-Add Fused Unit Design , 2007, 18th IEEE Symposium on Computer Arithmetic (ARITH '07).

[13]  Florent de Dinechin,et al.  Designing Custom Arithmetic Data Paths with FloPoCo , 2011, IEEE Design & Test of Computers.

[14]  T. Lang,et al.  Floating-point fused multiply-add with reduced latency , 2002, Proceedings. IEEE International Conference on Computer Design: VLSI in Computers and Processors.

[15]  Peter Deuflhard,et al.  Numerische Mathematik. I , 2002 .

[16]  Michael J. Flynn,et al.  Reducing the Mean Latency of Floating-Point Addition , 1998, Theor. Comput. Sci..

[17]  Silvia M. Müller,et al.  The POWER7 Binary Floating-Point Unit , 2011, 2011 IEEE 20th Symposium on Computer Arithmetic.

[18]  Neil Burgess,et al.  Overcoming double-rounding errors under IEEE 754-2008 using software , 2010, 2010 Conference Record of the Forty Fourth Asilomar Conference on Signals, Systems and Computers.

[19]  Jean-Michel Muller,et al.  Handbook of Floating-Point Arithmetic (2nd Ed.) , 2018 .

[20]  Douglas M. Priest,et al.  Algorithms for arbitrary precision floating point arithmetic , 1991, [1991] Proceedings 10th IEEE Symposium on Computer Arithmetic.

[21]  Ivo Babuska Numerical stability in mathematical analysis , 1968, IFIP Congress.

[22]  E.E. Swartzlander,et al.  Floating-Point Fused Multiply-Add Architectures , 2007, 2007 Conference Record of the Forty-First Asilomar Conference on Signals, Systems and Computers.