A Survey of Numerical Methods Utilizing Mixed Precision Arithmetic

Within the past years, hardware vendors have started designing low precision special function units in response to the demand of the Machine Learning community and their demand for high compute power in low precision formats. Also the server-line products are increasingly featuring low-precision special function units, such as the NVIDIA tensor cores in ORNL's Summit supercomputer providing more than an order of magnitude higher performance than what is available in IEEE double precision. At the same time, the gap between the compute power on the one hand and the memory bandwidth on the other hand keeps increasing, making data access and communication prohibitively expensive compared to arithmetic operations. To start the multiprecision focus effort, we survey the numerical linear algebra community and summarize all existing multiprecision knowledge, expertise, and software capabilities in this landscape analysis report. We also include current efforts and preliminary results that may not yet be considered "mature technology," but have the potential to grow into production quality within the multiprecision focus effort. As we expect the reader to be familiar with the basics of numerical linear algebra, we refrain from providing a detailed background on the algorithms themselves but focus on how mixed- and multiprecision technology can help improving the performance of these methods and present highlights of application significantly outperforming the traditional fixed precision methods.

[1]  Enrique S. Quintana-Ortí,et al.  Toward a modular precision ecosystem for high-performance computing , 2019, Int. J. High Perform. Comput. Appl..

[2]  Valeria Simoncini,et al.  Theory of Inexact Krylov Subspace Methods and Applications to Scientific Computing , 2003, SIAM J. Sci. Comput..

[3]  Jack J. Dongarra,et al.  The Design of Fast and Energy-Efficient Linear Solvers: On the Potential of Half-Precision Arithmetic and Iterative Refinement Techniques , 2018, ICCS.

[4]  Maximilian Emans,et al.  Mixed-precision AMG as linear equation solver for definite systems , 2010, ICCS.

[5]  Jack J. Dongarra,et al.  Algorithm 589: SICEDR: A FORTRAN Subroutine for Improving the Accuracy of Computed Matrix Eigenvalues , 1982, TOMS.

[6]  C. Paige Accuracy and effectiveness of the Lanczos algorithm for the symmetric eigenproblem , 1980 .

[7]  Martin Kronbichler,et al.  Multigrid for Matrix-Free High-Order Finite Element Computations on Graphics Processors , 2019, ACM Trans. Parallel Comput..

[8]  H. Walker Implementation of the GMRES method using householder transformations , 1988 .

[9]  Miroslav Rozlozník,et al.  Modified Gram-Schmidt (MGS), Least Squares, and Backward Stability of MGS-GMRES , 2006, SIAM J. Matrix Anal. Appl..

[10]  A. Greenbaum Estimating the Attainable Accuracy of Recursively Computed Residual Methods , 1997, SIAM J. Matrix Anal. Appl..

[11]  Nicholas J. Higham,et al.  Harnessing GPU Tensor Cores for Fast FP16 Arithmetic to Speed up Mixed-Precision Iterative Refinement Solvers , 2018, SC18: International Conference for High Performance Computing, Networking, Storage and Analysis.

[12]  Inderjit S. Dhillon,et al.  The design and implementation of the MRRR algorithm , 2006, TOMS.

[13]  John L. Gustafson,et al.  The End of Error: Unum Computing , 2015 .

[14]  Yusaku Yamamoto,et al.  Roundoff error analysis of the Cholesky QR2 algorithm , 2015 .

[15]  NICHOLAS J. HIGHAM,et al.  Exploiting Lower Precision Arithmetic in Solving Symmetric Positive Definite Linear Systems and Least Squares Problems , 2019, SIAM J. Sci. Comput..

[16]  Zdenek Strakos,et al.  Residual and Backward Error Bounds in Minimum Residual Krylov Subspace Methods , 2001, SIAM J. Sci. Comput..

[17]  Enrique S. Quintana-Ortí,et al.  Adaptive precision in block‐Jacobi preconditioning for iterative sparse linear system solvers , 2019, Concurr. Comput. Pract. Exp..

[18]  Åke Björck Iterative refinement of linear least squares solutions I , 1967 .

[19]  J. H. Wilkinson,et al.  IMPROVING THE ACCURACY OF COMPUTED EIGENVALUES AND EIGENVECTORS , 1983 .

[20]  Stephen F. McCormick,et al.  Algebraic error analysis for mixed-precision multigrid solvers , 2020, SIAM J. Sci. Comput..

[21]  James Demmel,et al.  Error bounds from extra-precise iterative refinement , 2006, TOMS.

[22]  G. Stewart Introduction to matrix computations , 1973 .

[23]  Y. Saad,et al.  GMRES: a generalized minimal residual algorithm for solving nonsymmetric linear systems , 1986 .

[24]  Jack Dongarra,et al.  GPUDirect MPI Communications and Optimizations to Accelerate FFTs on Exascale Systems , 2019 .

[25]  Walter Gander,et al.  Gram‐Schmidt orthogonalization: 100 years and more , 2013, Numer. Linear Algebra Appl..

[26]  Nicholas J. Higham,et al.  Squeezing a Matrix into Half Precision, with an Application to Solving Linear Systems , 2019, SIAM J. Sci. Comput..

[27]  ERIN CARSON,et al.  Three-Precision GMRES-Based Iterative Refinement for Least Squares Problems , 2020, SIAM J. Sci. Comput..

[28]  Å. Björck Solving linear least squares problems by Gram-Schmidt orthogonalization , 1967 .

[29]  Åke Björck,et al.  Iterative refinement of linear least squares solutions II , 1967 .

[30]  Nicholas J. Higham,et al.  The accuracy of solutions to triangular systems , 1989 .

[31]  G. Meurant,et al.  The Lanczos and conjugate gradient algorithms in finite precision arithmetic , 2006, Acta Numerica.

[32]  Terry Cojean,et al.  A customized precision format based on mantissa segmentation for accelerating sparse linear algebra , 2019, Concurr. Comput. Pract. Exp..

[33]  Takeshi Ogita,et al.  Iterative refinement for symmetric eigenvalue decomposition II: clustered eigenvalues , 2019, Japan journal of industrial and applied mathematics.

[34]  Théo Mary,et al.  Sharper Probabilistic Backward Error Analysis for Basic Linear Algebra Kernels with Random Data , 2020, SIAM J. Sci. Comput..

[35]  Martin Kronbichler,et al.  Multigrid for matrix-free finite element computations on graphics processors , 2017 .

[36]  Sivasankaran Rajamanickam,et al.  Amesos2 and Belos: Direct and iterative solvers for large sparse linear systems , 2012, Sci. Program..

[37]  C. Puglisi Modification of the householder method based on the compact WY representation , 1992 .

[38]  Nicholas J. Higham,et al.  Simulating Low Precision Floating-Point Arithmetic , 2019, SIAM J. Sci. Comput..

[39]  Qiang Ye,et al.  Residual Replacement Strategies for Krylov Subspace Iterative Methods for the Convergence of True Residuals , 2000, SIAM J. Sci. Comput..

[40]  Nicholas J. Higham,et al.  A New Analysis of Iterative Refinement and Its Application to Accurate Solution of Ill-Conditioned Sparse Linear Systems , 2017, SIAM J. Sci. Comput..

[41]  Kipton Barros,et al.  Solving lattice QCD systems of equations using mixed precision solvers on GPUs , 2009, Comput. Phys. Commun..

[42]  Serge Gratton,et al.  Exploiting variable precision in GMRES , 2019, ArXiv.

[43]  Hyoukjun Kwon,et al.  MAERI: Enabling Flexible Dataflow Mapping over DNN Accelerators via Reconfigurable Interconnects , 2018, ASPLOS.

[44]  Vivienne Sze,et al.  Eyeriss v2: A Flexible and High-Performance Accelerator for Emerging Deep Neural Networks , 2018, ArXiv.

[45]  Jack Dongarra,et al.  Fast Batched Matrix Multiplication for Small Sizes Using Half-Precision Arithmetic on GPUs , 2019, 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[46]  Xiaobai Sun,et al.  Aggregations of Elementary Transformations , 1996 .

[47]  Jack Dongarra,et al.  Numerical linear algebra on emerging architectures: The PLASMA and MAGMA projects , 2009 .

[48]  Julien Langou,et al.  A Rank-k Update Procedure for Reorthogonalizing the Orthogonal Factor from Modified Gram-Schmidt , 2004, SIAM J. Matrix Anal. Appl..

[49]  Christopher C. Paige,et al.  The Effects of Loss of Orthogonality on Large Scale Numerical Computations , 2018, ICCSA.

[50]  James Demmel,et al.  Extra-Precise Iterative Refinement for Overdetermined Least Squares Problems , 2009, TOMS.

[51]  Bora Uçar,et al.  A Symmetry Preserving Algorithm for Matrix Scaling , 2014, SIAM J. Matrix Anal. Appl..

[52]  Stanimire Tomov,et al.  Impacts of Multi-GPU MPI Collective Communications on Large FFT Computation , 2019, 2019 IEEE/ACM Workshop on Exascale MPI (ExaMPI).

[53]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[54]  Jack Dongarra,et al.  Design and Implementation for FFT-ECP on Distributed Accelerated Systems , 2019 .

[55]  Teruo Tanaka,et al.  Mixed-Precision AMG method for Many Core Accelerators , 2014, EuroMPI/ASIA.

[56]  Pritish Narayanan,et al.  Deep Learning with Limited Numerical Precision , 2015, ICML.

[57]  Takeshi Ogita,et al.  Iterative refinement for symmetric eigenvalue decomposition , 2018 .

[58]  David A. Patterson,et al.  In-datacenter performance analysis of a tensor processing unit , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[59]  Sebastian Schöps,et al.  GPU-accelerated mixed precision algebraic multigrid preconditioners for discrete elliptic field problems , 2014 .

[60]  Nicholas J. Higham,et al.  Mixed Precision Block Fused Multiply-Add: Error Analysis and Application to GPU Tensor Cores , 2020, SIAM J. Sci. Comput..

[61]  H. V. der Residual Replacement Strategies for Krylov Subspace Iterative Methods for the Convergence of True Residuals , 2000 .

[62]  Allan Peter Engsig-Karup,et al.  A Fast GPU-Accelerated Mixed-Precision Strategy for Fully Nonlinear Water Wave Computations , 2013 .

[63]  Higham Nicholas Error Analysis For Standard and GMRES-Based Iterative Refinement in Two and Three-Precisions , 2019 .

[64]  Ed Anderson,et al.  LAPACK Users' Guide , 1995 .

[65]  Jack J. Dongarra,et al.  Investigating half precision arithmetic to accelerate dense linear system solvers , 2017, ScalA@SC.

[66]  Karl Rupp,et al.  Preparing sparse solvers for exascale computing , 2020, Philosophical Transactions of the Royal Society A.

[67]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[68]  Nicholas J. Higham,et al.  A New Approach to Probabilistic Rounding Error Analysis , 2019, SIAM J. Sci. Comput..

[69]  Gerard L. G. Sleijpen,et al.  Inexact Krylov Subspace Methods for Linear Systems , 2004, SIAM J. Matrix Anal. Appl..

[70]  Julien Langou,et al.  A note on the error analysis of classical Gram–Schmidt , 2006, Numerische Mathematik.

[71]  Erin Carson,et al.  Communication-Avoiding Krylov Subspace Methods in Theory and Practice , 2015 .

[72]  Yoshua Bengio,et al.  Training deep neural networks with low precision multiplications , 2014 .

[73]  Peter Lindstrom,et al.  Error Analysis of ZFP Compression for Floating-Point Data , 2018, SIAM J. Sci. Comput..

[74]  Nicholas J. Higham,et al.  Accelerating the Solution of Linear Systems by Iterative Refinement in Three Precisions , 2018, SIAM J. Sci. Comput..

[75]  Takateru Yamagishi,et al.  GPU Acceleration of a Non-hydrostatic Ocean Model with a Multigrid Poisson/Helmholtz Solver , 2016, ICCS.

[76]  Christopher C. Paige,et al.  Loss and Recapture of Orthogonality in the Modified Gram-Schmidt Algorithm , 1992, SIAM J. Matrix Anal. Appl..

[77]  Stanimire Tomov,et al.  Accelerating 2D FFT: Exploit GPU Tensor Cores through Mixed-Precision , 2018 .

[78]  Stephen F. McCormick,et al.  Discretization-error-accurate mixed-precision multigrid solvers , 2020, ArXiv.

[79]  James Demmel,et al.  Applied Numerical Linear Algebra , 1997 .

[80]  Nikolaos V. Sahinidis,et al.  Scaling linear optimization problems prior to application of the simplex method , 2012, Comput. Optim. Appl..

[81]  Dong Yu,et al.  1-bit stochastic gradient descent and its application to data-parallel distributed training of speech DNNs , 2014, INTERSPEECH.

[82]  Cleve B. Moler,et al.  Iterative Refinement in Floating Point , 1967, JACM.

[83]  Christopher C. Paige,et al.  Properties of a Unitary Matrix Obtained from a Sequence of Normalized Vectors , 2014, SIAM J. Matrix Anal. Appl..

[84]  Tze Meng Low,et al.  Accumulating Householder transformations, revisited , 2006, TOMS.

[85]  W. Prager,et al.  Compatibility of approximate solution of linear equations with given error bounds for coefficients and right-hand sides , 1964 .

[86]  Prabhat,et al.  Exascale Deep Learning for Climate Analytics , 2018, SC18: International Conference for High Performance Computing, Networking, Storage and Analysis.

[87]  Yusaku Yamamoto,et al.  Shifted Cholesky QR for Computing the QR Factorization of Ill-Conditioned Matrices , 2018, SIAM J. Sci. Comput..

[88]  JESSE L. BARLOW,et al.  Block Modified Gram-Schmidt Algorithms and Their Analysis , 2019, SIAM J. Matrix Anal. Appl..

[89]  A. Greenbaum Behavior of slightly perturbed Lanczos and conjugate-gradient recurrences , 1989 .

[90]  M. Rozložník Numerics of Gram-Schmidt orthogonalization , 2007 .

[91]  Nicholas J. Higham,et al.  INVERSE PROBLEMS NEWSLETTER , 1991 .

[92]  J. Malard,et al.  Efficiency and scalability of two parallel QR factorization algorithms , 1994, Proceedings of IEEE Scalable High Performance Computing Conference.

[93]  Enrique S. Quintana-Ortí,et al.  Improved Accuracy and Parallelism for MRRR-Based Eigensolvers - A Mixed Precision Approach , 2013, SIAM J. Sci. Comput..

[94]  Stanimire Tomov,et al.  Optimizing the Fast Fourier Transform Using Mixed Precision on Tensor Core Hardware , 2018, 2018 IEEE 25th International Conference on High Performance Computing Workshops (HiPCW).

[95]  Jack J. Dongarra,et al.  Mixed-Precision Cholesky QR Factorization and Its Case Studies on Multicore CPU with Multiple GPUs , 2015, SIAM J. Sci. Comput..

[96]  Gerard L. G. Sleijpen,et al.  Reliable updated residuals in hybrid Bi-CG methods , 1996, Computing.

[97]  Xiaomei Yang Rounding Errors in Algebraic Processes , 1964, Nature.

[98]  Jack J. Dongarra,et al.  Stability and Performance of Various Singular Value QR Implementations on Multicore CPU with a GPU , 2016, ACM Trans. Math. Softw..