Numerical reproducibility for the parallel reduction on multi- and many-core architectures

A parallel algorithm to compute correctly-rounded floating-point sumsHighly-optimized implementations for modern CPUs, GPUs and Xeon PhiAs fast as memory bandwidth allows for large sums with moderate dynamic rangeScales well with the problem size and resources used on a cluster of compute nodes On modern multi-core, many-core, and heterogeneous architectures, floating-point computations, especially reductions, may become non-deterministic and, therefore, non-reproducible mainly due to the non-associativity of floating-point operations. We introduce an approach to compute the correctly rounded sums of large floating-point vectors accurately and efficiently, achieving deterministic results by construction. Our multi-level algorithm consists of two main stages: first, a filtering stage that relies on fast vectorized floating-point expansion; second, an accumulation stage based on superaccumulators in a high-radix carry-save representation. We present implementations on recent Intel desktop and server processors, Intel Xeon Phi co-processors, and both AMD and NVIDIA GPUs. We show that numerical reproducibility and bit-perfect accuracy can be achieved at no additional cost for large sums that have dynamic ranges of up to 90 orders of magnitude by leveraging arithmetic units that are left underused by standard reduction algorithms.

[1]  Ulrich W. Kulisch,et al.  Comments on Fast and Exact Accumulation of Products , 2010, PARA.

[2]  James Demmel,et al.  IEEE Standard for Floating-Point Arithmetic , 2008 .

[3]  Alex Fit-Florea,et al.  Precision and Performance: Floating Point and IEEE 754 Compliance for NVIDIA GPUs , 2011 .

[4]  Wayne B. Hayes,et al.  Algorithm 908 , 2010 .

[5]  James Demmel,et al.  Fast Reproducible Floating-Point Summation , 2013, 2013 IEEE 21st Symposium on Computer Arithmetic.

[6]  Guillaume Melquiond,et al.  Emulation of a FMA and Correctly Rounded Sums: Proved Algorithms Using Rounding to Odd , 2008, IEEE Transactions on Computers.

[7]  James Demmel,et al.  Design, implementation and testing of extended and mixed precision BLAS , 2000, TOMS.

[8]  James Demmel,et al.  Numerical Reproducibility and Accuracy at ExaScale , 2013, 2013 IEEE 21st Symposium on Computer Arithmetic.

[9]  Jean-Michel Muller,et al.  Handbook of Floating-Point Arithmetic (2nd Ed.) , 2018 .

[10]  Nicholas J. Higham,et al.  INVERSE PROBLEMS NEWSLETTER , 1991 .

[11]  Donald E. Knuth,et al.  The art of computer programming. Vol.2: Seminumerical algorithms , 1981 .

[12]  Vincent Lefèvre,et al.  MPFR: A multiple-precision binary floating-point library with correct rounding , 2007, TOMS.

[13]  Radford M. Neal Fast exact summation using small and large superaccumulators , 2015, ArXiv.

[14]  James Reinders,et al.  Intel® threading building blocks , 2008 .

[15]  David Defour,et al.  Reproducible and Accurate Matrix Multiplication for GPU Accelerators , 2015 .

[16]  Jonathan M. Borwein,et al.  High-precision computation: Mathematical physics and dynamics , 2010, Appl. Math. Comput..

[17]  David Thomas,et al.  The Art in Computer Programming , 2001 .

[18]  Anna Gavling,et al.  The ART at , 2008 .

[19]  Rodney A. Kennedy,et al.  Efficient Histogram Algorithms for NVIDIA CUDA Compatible Devices , 2007 .

[20]  James Demmel,et al.  Parallel Reproducible Summation , 2015, IEEE Transactions on Computers.

[21]  J. Muller,et al.  CR-LIBM A library of correctly rounded elementary functions in double-precision , 2006 .

[22]  David Defour,et al.  SOFTWARE CARRY-SAVE FOR FAST MULTIPLE-PRECISION ALGORITHMS , 2002 .

[23]  Jim Euchner Design , 2014, Catalysis from A to Z.

[24]  Jonathan Richard Shewchuk,et al.  Robust adaptive floating-point geometric predicates , 1996, SCG '96.

[25]  David Defour,et al.  Reproducible Triangular Solvers for High-Performance Computing , 2015, 2015 12th International Conference on Information Technology - New Generations.

[26]  Siegfried M. Rump,et al.  Ultimately Fast Accurate Summation , 2009, SIAM J. Sci. Comput..

[27]  Ulrich W. Kulisch,et al.  The exact dot product as basic tool for long interval arithmetic , 2011, Computing.

[28]  Xiaoye S. Li,et al.  Algorithms for quad-double precision floating point arithmetic , 2000, Proceedings 15th IEEE Symposium on Computer Arithmetic. ARITH-15 2001.

[29]  Torsten Hoefler,et al.  Designing Bit-Reproducible Portable High-Performance Applications , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.