Autotuning Runtime Specialization for Sparse Matrix-Vector Multiplication

Runtime specialization is used for optimizing programs based on partial information available only at runtime. In this paper we apply autotuning on runtime specialization of Sparse Matrix-Vector Multiplication to predict a best specialization method among several. In 91% to 96% of the predictions, either the best or the second-best method is chosen. Predictions achieve average speedups that are very close to the speedups achievable when only the best methods are used. By using an efficient code generator and a carefully designed set of matrix features, we show the runtime costs can be amortized to bring performance benefits for many real-world cases.

[1]  Chung-chieh Shan,et al.  Shonan challenge for generative programming: short position paper , 2013, PEPM '13.

[2]  Dominik Grewe,et al.  Automatically generating and tuning GPU code for sparse matrix-vector multiplication from a high-level representation , 2011, GPGPU-4.

[3]  Martin Odersky,et al.  Lightweight modular staging: a pragmatic approach to runtime code generation and compiled DSLs , 2010, GPCE '10.

[4]  Ping Guo,et al.  A Performance Modeling and Optimization Analysis Tool for Sparse Matrix-Vector Multiplication on GPUs , 2014, IEEE Transactions on Parallel and Distributed Systems.

[5]  Jack J. Dongarra,et al.  Automated empirical optimizations of software and the ATLAS project , 2001, Parallel Comput..

[6]  Endong Wang,et al.  Intel Math Kernel Library , 2014 .

[7]  Michael Garland,et al.  Implementing sparse matrix-vector multiplication on throughput-oriented processors , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[8]  Xing Liu,et al.  Efficient sparse matrix-vector multiplication on x86-based many-core processors , 2013, ICS '13.

[9]  John R. Gilbert,et al.  Parallel sparse matrix-vector and matrix-transpose-vector multiplication using compressed sparse blocks , 2009, SPAA '09.

[10]  Nectarios Koziris,et al.  Understanding the Performance of Sparse Matrix-Vector Multiplication , 2008, 16th Euromicro Conference on Parallel, Distributed and Network-Based Processing (PDP 2008).

[11]  Mary W. Hall,et al.  Loop and data transformations for sparse matrix code , 2015, PLDI.

[12]  Samuel N. Kamin,et al.  Jumbo: run-time code generation for Java and its applications , 2003, International Symposium on Code Generation and Optimization, 2003. CGO 2003..

[13]  Kalyan Veeramachaneni,et al.  Autotuning algorithmic choice for input sensitivity , 2015, PLDI.

[14]  Andrew Lumsdaine,et al.  Accelerating sparse matrix computations via data compression , 2006, ICS '06.

[15]  Katherine Yelick,et al.  OSKI: A library of automatically tuned sparse matrix kernels , 2005 .

[16]  OlukotunKunle,et al.  Optimizing data structures in high-level programs , 2013 .

[17]  Richard W. Vuduc,et al.  Model-driven autotuning of sparse matrix-vector multiply on GPUs , 2010, PPoPP '10.

[18]  Timothy A. Davis,et al.  The university of Florida sparse matrix collection , 2011, TOMS.

[19]  Yousef Saad,et al.  Iterative methods for sparse linear systems , 2003 .

[20]  Yuefan Deng,et al.  New trends in high performance computing , 2001, Parallel Computing.

[21]  Nectarios Koziris,et al.  Exploiting compression opportunities to improve SpMxV performance on shared memory systems , 2010, TACO.

[22]  Ninghui Sun,et al.  SMAT: an input adaptive auto-tuner for sparse matrix-vector multiplication , 2013, PLDI.

[23]  Franz Franchetti,et al.  SPIRAL: Code Generation for DSP Transforms , 2005, Proceedings of the IEEE.

[24]  Alistair P. Rendell,et al.  Runtime sparse matrix format selection , 2010, ICCS.

[25]  John M. Mellor-Crummey,et al.  Optimizing Sparse Matrix–Vector Product Computations Using Unroll and Jam , 2004, Int. J. High Perform. Comput. Appl..

[26]  James Demmel,et al.  Statistical Models for Empirical Search-Based Performance Tuning , 2004, Int. J. High Perform. Comput. Appl..

[27]  Ting Wang,et al.  Optimizing SpMV for Diagonal Sparse Matrices on GPU , 2011, 2011 International Conference on Parallel Processing.

[28]  Prakash S. Raghavendra,et al.  Predicting an Optimal Sparse Matrix Format for SpMV Computation on GPU , 2014, 2014 IEEE International Parallel & Distributed Processing Symposium Workshops.

[29]  Keshav Pingali,et al.  Next-generation generic programming and its application to sparse matrix computations , 2000, ICS '00.

[30]  Ankit Jain pOSKI : An Extensible Autotuning Framework to Perform Optimized SpMVs on Multicore Architectures , 2008 .

[31]  Matteo Frigo,et al.  A fast Fourier transform compiler , 1999, SIGP.

[32]  Fred G. Gustavson,et al.  Symbolic Generation of an Optimal Crout Algorithm for Sparse Systems of Linear Equations , 1970, JACM.

[33]  Pascal Giorgi,et al.  Generating Optimized Sparse Matrix Vector Product over Finite Fields , 2014, ICMS.

[34]  Alistair P. Rendell,et al.  Reinforcement learning for automated performance tuning: Initial evaluation for sparse matrix format selection , 2008, 2008 IEEE International Conference on Cluster Computing.

[35]  Nectarios Koziris,et al.  CSX: an extended compression format for spmv on shared memory systems , 2011, PPoPP '11.

[36]  Jacques Carette,et al.  Multi-stage programming with functors and monads: eliminating abstraction overhead from generic code , 2005, GPCE'05.

[37]  David E. Keyes,et al.  Towards Realistic Performance Bounds for Implicit CFD Codes , 2000 .

[38]  Richard W. Vuduc,et al.  Sparsity: Optimization Framework for Sparse Matrix Kernels , 2004, Int. J. High Perform. Comput. Appl..

[39]  Samuel Williams,et al.  Optimization of sparse matrix-vector multiplication on emerging multicore platforms , 2009, Parallel Comput..

[40]  Katherine Yelick,et al.  Autotuning Sparse Matrix-Vector Multiplication for Multicore , 2012 .

[41]  Samuel Williams,et al.  Optimization of sparse matrix-vector multiplication on emerging multicore platforms , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[42]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[43]  Yoshinari Fukui,et al.  Supercomputing of circuits simulation , 1989, Supercomputing '89.

[44]  Vikram S. Adve,et al.  LLVM: a compilation framework for lifelong program analysis & transformation , 2004, International Symposium on Code Generation and Optimization, 2004. CGO 2004..

[45]  Alistair P. Rendell,et al.  Generating optimal CUDA sparse matrix–vector product implementations for evolving GPU hardware , 2012, Concurr. Comput. Pract. Exp..

[46]  Calvin J. Ribbens,et al.  A Library for Pattern-based Sparse Matrix Vector Multiply , 2011, International Journal of Parallel Programming.

[47]  Julia L. Lawall,et al.  A tour of Tempo: a program specializer for the C language , 2004, Sci. Comput. Program..

[48]  Walid A. Abu-Sufah,et al.  Auto-tuning of Sparse Matrix-Vector Multiplication on Graphics Processors , 2013, ISC.

[49]  Kurt Keutzer,et al.  clSpMV: A Cross-Platform OpenCL SpMV Framework on GPUs , 2012, ICS '12.

[50]  Samuel Williams,et al.  Reduced-Bandwidth Multithreaded Algorithms for Sparse Matrix-Vector Multiplication , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[51]  Peter Lee,et al.  Optimizing ML with run-time code generation , 1996, PLDI '96.

[52]  Michael Garland,et al.  Nitro: A Framework for Adaptive Code Variant Tuning , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[53]  Eduardo F. D'Azevedo,et al.  Vectorized Sparse Matrix Multiply for Compressed Row Storage Format , 2005, International Conference on Computational Science.