Reproducible and Accurate Matrix Multiplication for GPU Accelerators

Due to non-associativity of floating-point operations and dynamic scheduling on parallel architectures, getting a bitwise reproducible floating-point result for multiple executions of the same code on different or even similar parallel architectures is challenging. In this paper, we address the problem of reproducibility in the context of matrix multiplication and propose an algorithm that yields both reproducible and accurate results. This algorithm is composed of two main stages: a filtering stage that uses fast vectorized floating-point expansions in con-junction with error-free transformations; an accumulation stage based on Kulisch long accumulators in a high-radix carry-save representation. Finally, we provide implementations and performance results in parallel environments like GPUs.

[1]  Alex Fit-Florea,et al.  Precision and Performance: Floating Point and IEEE 754 Compliance for NVIDIA GPUs , 2011 .

[2]  James Demmel,et al.  Numerical Reproducibility and Accuracy at ExaScale , 2013, 2013 IEEE 21st Symposium on Computer Arithmetic.

[3]  Robert A. van de Geijn,et al.  High-performance implementation of the level-3 BLAS , 2008, TOMS.

[4]  Ulrich W. Kulisch,et al.  The exact dot product as basic tool for long interval arithmetic , 2011, Computing.

[5]  Xiaoye S. Li,et al.  Algorithms for quad-double precision floating point arithmetic , 2000, Proceedings 15th IEEE Symposium on Computer Arithmetic. ARITH-15 2001.

[6]  Paolo Bientinesi,et al.  Solving sequences of generalized least-squares problems on multi-threaded architectures , 2014, Appl. Math. Comput..

[7]  James Demmel,et al.  IEEE Standard for Floating-Point Arithmetic , 2008 .

[8]  James Demmel,et al.  Design, implementation and testing of extended and mixed precision BLAS , 2000, TOMS.

[9]  Jack J. Dongarra,et al.  A set of level 3 basic linear algebra subprograms , 1990, TOMS.

[10]  Charles L. Lawson,et al.  Basic Linear Algebra Subprograms for Fortran Usage , 1979, TOMS.

[11]  Nicholas J. Higham,et al.  INVERSE PROBLEMS NEWSLETTER , 1991 .

[12]  Donald E. Knuth,et al.  The art of computer programming. Vol.2: Seminumerical algorithms , 1981 .

[13]  Donald Ervin Knuth,et al.  The Art of Computer Programming , 1968 .

[14]  Robert A. van de Geijn,et al.  Anatomy of high-performance matrix multiplication , 2008, TOMS.

[15]  Anna Gavling,et al.  The ART at , 2008 .

[16]  Jack J. Dongarra,et al.  Automatically Tuned Linear Algebra Software , 1998, Proceedings of the IEEE/ACM SC98 Conference.

[17]  Jack J. Dongarra,et al.  An extended set of FORTRAN basic linear algebra subprograms , 1988, TOMS.

[18]  Jean-Michel Muller,et al.  Handbook of Floating-Point Arithmetic (2nd Ed.) , 2018 .

[19]  Tomoya Sakai,et al.  Multi-level Optimization of Matrix Multiplication for GPU-equipped Systems , 2011, ICCS.

[20]  James Demmel,et al.  Fast Reproducible Floating-Point Summation , 2013, 2013 IEEE 21st Symposium on Computer Arithmetic.

[21]  Paolo Bientinesi,et al.  Computing Petaflops over Terabytes of Data , 2012, ACM Trans. Math. Softw..

[22]  David Defour,et al.  Full-Speed Deterministic Bit-Accurate Parallel Floating-Point Summation on Multi- and Many-Core Architectures , 2014 .