论文信息 - Accurate and Efficient Floating Point Summation

Accurate and Efficient Floating Point Summation

We present and analyze several simple algorithms for accurately computing the sum of n floating point numbers using a wider accumulator. Let f and F be the number of significant bits in the summands and the accumulator, respectively. Then assuming gradual underflow, no overflow, and round-to-nearest arithmetic, up to approximately 2F-f numbers can be added accurately by simply summing the terms in decreasing order of exponents, yielding a sum correct to within about 1.5 units in the last place (ulps). We apply this result to the floating point formats in the IEEE floating point standard. For example, a dot product of single precision vectors of length at most 33 computed using double precision and sorting is guaranteed correct to nearly 1.5 ulps. If double-extended precision is used, the vector length can be as large as 65,537. We also investigate how the cost of sorting can be reduced or eliminated while retaining accuracy.

James Demmel | Yozo Hida | Yozo Hida | J. Demmel

[1] M. Pichat,et al. Correction d'une somme en arithmetique a virgule flottante , 1972 .

[2] D. R. Ross. Reducing truncation errors using cascading accumulators , 1965, CACM.

[3] Willard L. Miranker,et al. Computer arithmetic in theory and practice , 1981, Computer science and applied mathematics.

[4] Douglas M. Priest,et al. Algorithms for arbitrary precision floating point arithmetic , 1991, [1991] Proceedings 10th IEEE Symposium on Computer Arithmetic.

[5] David Thomas,et al. The Art in Computer Programming , 2001 .

[6] Ulrich W. Kulisch,et al. Formalization and implementation of floating-point matrix operations , 2005, Computing.

[7] Gerd Bohlender,et al. Floating-Point Computation of Functions with Maximum Accuracy , 1975, IEEE Transactions on Computers.

[8] Nicholas J. Higham,et al. The Accuracy of Floating Point Summation , 1993, SIAM J. Sci. Comput..

[9] Jonathan Richard Shewchuk,et al. Adaptive Precision Floating-Point Arithmetic and Fast Robust Geometric Predicates , 1997, Discret. Comput. Geom..

[10] Mei Han An,et al. accuracy and stability of numerical algorithms , 1991 .

[11] Claus-Peter Schnorr,et al. Lattice Basis Reduction: Improved Practical Algorithms and Solving Subset Sum Problems , 1991, FCT.

[12] Jack M. Wolfe. Reducing truncation errors by programming , 1964, CACM.

[13] Seppo Linnainmaa,et al. Software for Doubled-Precision Floating-Point Computations , 1981, TOMS.

[14] James Demmel,et al. Fast and Accurate Floating Point Summation with Application to Computational Geometry , 2004, Numerical Algorithms.

[15] Douglas M. Priest. On properties of floating point arithmetics: numerical stability and the cost of accurate computations , 1992 .

[16] Michael A. Malcolm,et al. On accurate floating-point summation , 1971, CACM.

[17] Wilhelm Oberaigner,et al. Parallel algorithms for the rounding exact summation of floating point numbers , 1982, Computing.

[18] T. J. Dekker,et al. A floating-point technique for extending the available precision , 1971 .

[19] Ole Møller. Quasi double-precision in floating point addition , 1965 .