Accurate and Efficient Floating Point Summation

We present and analyze several simple algorithms for accurately computing the sum of n floating point numbers using a wider accumulator. Let f and F be the number of significant bits in the summands and the accumulator, respectively. Then assuming gradual underflow, no overflow, and round-to-nearest arithmetic, up to approximately 2F-f numbers can be added accurately by simply summing the terms in decreasing order of exponents, yielding a sum correct to within about 1.5 units in the last place (ulps). We apply this result to the floating point formats in the IEEE floating point standard. For example, a dot product of single precision vectors of length at most 33 computed using double precision and sorting is guaranteed correct to nearly 1.5 ulps. If double-extended precision is used, the vector length can be as large as 65,537. We also investigate how the cost of sorting can be reduced or eliminated while retaining accuracy.

[1]  M. Pichat,et al.  Correction d'une somme en arithmetique a virgule flottante , 1972 .

[2]  D. R. Ross Reducing truncation errors using cascading accumulators , 1965, CACM.

[3]  Willard L. Miranker,et al.  Computer arithmetic in theory and practice , 1981, Computer science and applied mathematics.

[4]  Douglas M. Priest,et al.  Algorithms for arbitrary precision floating point arithmetic , 1991, [1991] Proceedings 10th IEEE Symposium on Computer Arithmetic.

[5]  David Thomas,et al.  The Art in Computer Programming , 2001 .

[6]  Ulrich W. Kulisch,et al.  Formalization and implementation of floating-point matrix operations , 2005, Computing.

[7]  Gerd Bohlender,et al.  Floating-Point Computation of Functions with Maximum Accuracy , 1975, IEEE Transactions on Computers.

[8]  Nicholas J. Higham,et al.  The Accuracy of Floating Point Summation , 1993, SIAM J. Sci. Comput..

[9]  Jonathan Richard Shewchuk,et al.  Adaptive Precision Floating-Point Arithmetic and Fast Robust Geometric Predicates , 1997, Discret. Comput. Geom..

[10]  Mei Han An,et al.  accuracy and stability of numerical algorithms , 1991 .

[11]  Claus-Peter Schnorr,et al.  Lattice Basis Reduction: Improved Practical Algorithms and Solving Subset Sum Problems , 1991, FCT.

[12]  Jack M. Wolfe Reducing truncation errors by programming , 1964, CACM.

[13]  Seppo Linnainmaa,et al.  Software for Doubled-Precision Floating-Point Computations , 1981, TOMS.

[14]  James Demmel,et al.  Fast and Accurate Floating Point Summation with Application to Computational Geometry , 2004, Numerical Algorithms.

[15]  Douglas M. Priest On properties of floating point arithmetics: numerical stability and the cost of accurate computations , 1992 .

[16]  Michael A. Malcolm,et al.  On accurate floating-point summation , 1971, CACM.

[17]  Wilhelm Oberaigner,et al.  Parallel algorithms for the rounding exact summation of floating point numbers , 1982, Computing.

[18]  T. J. Dekker,et al.  A floating-point technique for extending the available precision , 1971 .

[19]  Ole Møller Quasi double-precision in floating point addition , 1965 .