Data-driven Mixed Precision Sparse Matrix Vector Multiplication for GPUs

We optimize Sparse Matrix Vector multiplication (SpMV) using a mixed precision strategy (MpSpMV) for Nvidia V100 GPUs. The approach has three benefits: (1) It reduces computation time, (2) it reduces the size of the input matrix and therefore reduces data movement, and (3) it provides an opportunity for increased parallelism. MpSpMV’s decision to lower to single precision is data driven, based on individual nonzero values of the sparse matrix. On all real-valued matrices from the Sparse Matrix Collection, we obtain a maximum speedup of 2.61× and average speedup of 1.06× over double precision, while maintaining higher accuracy compared to single precision.

[1]  Khalid Ahmad,et al.  On Implementing Sparse Matrix Multi-vector Multiplication on GPUs , 2014, 2014 IEEE Intl Conf on High Performance Computing and Communications, 2014 IEEE 6th Intl Symp on Cyberspace Safety and Security, 2014 IEEE 11th Intl Conf on Embedded Software and Syst (HPCC,CSS,ICESS).

[2]  C GILYGIL,et al.  University of California, Los Angeles , 1963, Medical History.

[3]  Pavel Tvrdík,et al.  Evaluation Criteria for Sparse Matrix Storage Formats , 2016, IEEE Transactions on Parallel and Distributed Systems.

[4]  Heidi E. Ziegler Automated Mapping of Coarse-Grain Pipelined Applications to FPGA Systems , 2004, FPL.

[5]  David H. Bailey,et al.  High-precision floating-point arithmetic in scientific computation , 2004, Computing in Science & Engineering.

[6]  Pritish Narayanan,et al.  Deep Learning with Limited Numerical Precision , 2015, ICML.

[7]  Samuel Williams,et al.  A High Performance Block Eigensolver for Nuclear Configuration Interaction Calculations , 2017, IEEE Transactions on Parallel and Distributed Systems.

[8]  Ken Kennedy,et al.  Interprocedural transformations for parallel code generation , 1991, Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (Supercomputing '91).

[9]  Patrick Amestoy,et al.  MUMPS : A General Purpose Distributed Memory Sparse Solver , 2000, PARA.

[10]  Mary Hall,et al.  Autotuning, code generation and optimizing compiler technology for gpus , 2012 .

[11]  Chao Yang,et al.  Accelerating solvers for global atmospheric equations through mixed-precision data flow engine , 2013, 2013 23rd International Conference on Field programmable Logic and Applications.

[12]  John L. Gustafson,et al.  Beating Floating Point at its Own Game: Posit Arithmetic , 2017, Supercomput. Front. Innov..

[13]  Samuel Williams,et al.  Optimizing Sparse Matrix-Multiple Vectors Multiplication for Nuclear Configuration Interaction Calculations , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[14]  Yu Jung Lo Performance modeling for architectural and program analysis , 2015 .

[15]  Jack J. Dongarra,et al.  Using Mixed Precision for Sparse Matrix Computations to Enhance the Performance while Achieving 64-bit Accuracy , 2008, TOMS.

[16]  John L. Gustafson,et al.  A Radical Approach to Computation with Real Numbers , 2016, Supercomput. Front. Innov..

[17]  Mary W. Hall,et al.  Detecting Coarse - Grain Parallelism Using an Interprocedural Parallelizing Compiler , 1995, Proceedings of the IEEE/ACM SC95 Conference.

[18]  Michael Garland,et al.  Efficient Sparse Matrix-Vector Multiplication on CUDA , 2008 .

[19]  John L. Gustafson,et al.  The End of Error: Unum Computing , 2015 .

[20]  Jeffrey S. Vetter,et al.  NVIDIA Tensor Core Programmability, Performance & Precision , 2018, 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[21]  Xiaoye S. Li,et al.  An overview of SuperLU: Algorithms, implementation, and user interface , 2003, TOMS.

[22]  Arnold Neumaier,et al.  Introduction to Numerical Analysis , 2001 .

[23]  Jack J. Dongarra,et al.  Investigating half precision arithmetic to accelerate dense linear system solvers , 2017, ScalA@SC.

[24]  James Demmel,et al.  Precimonious: Tuning assistant for floating-point precision , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[25]  Mary Hall,et al.  Model-guided performance tuning for application-level parameters , 2009 .

[26]  Gabe Rudy,et al.  CUDA-CHiLL: A programming language interface for GPGPU optimizations and code generation , 2010 .

[27]  Timothy A. Davis,et al.  The university of Florida sparse matrix collection , 2011, TOMS.

[28]  Nicholas J. Higham,et al.  Harnessing GPU Tensor Cores for Fast FP16 Arithmetic to Speed up Mixed-Precision Iterative Refinement Solvers , 2018, SC18: International Conference for High Performance Computing, Networking, Storage and Analysis.

[29]  Robert Strzodka,et al.  Performance and accuracy of hardware-oriented native-, emulated- and mixed-precision solvers in FEM simulations , 2007, Int. J. Parallel Emergent Distributed Syst..

[30]  Andreas W. Götz,et al.  SPFP: Speed without compromise - A mixed precision model for GPU accelerated molecular dynamics simulations , 2013, Comput. Phys. Commun..

[31]  Chun Chen,et al.  Improving High-Performance Sparse Libraries Using Compiler-Assisted Specialization: A PETSc Case Study , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum.

[32]  Evangelos Eleftheriou,et al.  Mixed-precision architecture based on computational memory for training deep neural networks , 2018, 2018 IEEE International Symposium on Circuits and Systems (ISCAS).

[33]  Jon Louis Bentley,et al.  Multidimensional binary search trees used for associative searching , 1975, CACM.

[34]  Jack J. Dongarra,et al.  Towards numerical benchmark for half-precision floating point arithmetic , 2017, 2017 IEEE High Performance Extreme Computing Conference (HPEC).

[35]  Hao Wu,et al.  Mixed Precision Training , 2017, ICLR.

[36]  Yousef Saad,et al.  Iterative methods for sparse linear systems , 2003 .

[37]  Hal Finkel,et al.  Doing Moore with Less - Leapfrogging Moore's Law with Inexactness for Supercomputing , 2016, ArXiv.

[38]  James Demmel,et al.  Design, implementation and testing of extended and mixed precision BLAS , 2000, TOMS.

[39]  Khalid Ahmad,et al.  Optimizing LOBPCG: Sparse Matrix Loop and Data Transformations in Action , 2016, LCPC.

[40]  Timothy A. Davis,et al.  UMFPACK Version 4.3 User Guide , 2004 .

[41]  Evangelos Eleftheriou,et al.  Mixed-precision training of deep neural networks using computational memory , 2017, ArXiv.

[42]  Xiaoye S. Li,et al.  Algorithms for quad-double precision floating point arithmetic , 2000, Proceedings 15th IEEE Symposium on Computer Arithmetic. ARITH-15 2001.