论文信息 - Autotuning Runtime Specialization for Sparse Matrix-Vector Multiplication

Autotuning Runtime Specialization for Sparse Matrix-Vector Multiplication

Runtime specialization is used for optimizing programs based on partial information available only at runtime. In this paper we apply autotuning on runtime specialization of Sparse Matrix-Vector Multiplication to predict a best specialization method among several. In 91% to 96% of the predictions, either the best or the second-best method is chosen. Predictions achieve average speedups that are very close to the speedups achievable when only the best methods are used. By using an efficient code generator and a carefully designed set of matrix features, we show the runtime costs can be amortized to bring performance benefits for many real-world cases.

[1] Chung-chieh Shan,et al. Shonan challenge for generative programming: short position paper , 2013, PEPM '13.

[2] Dominik Grewe,et al. Automatically generating and tuning GPU code for sparse matrix-vector multiplication from a high-level representation , 2011, GPGPU-4.

[3] Martin Odersky,et al. Lightweight modular staging: a pragmatic approach to runtime code generation and compiled DSLs , 2010, GPCE '10.

[4] Ping Guo,et al. A Performance Modeling and Optimization Analysis Tool for Sparse Matrix-Vector Multiplication on GPUs , 2014, IEEE Transactions on Parallel and Distributed Systems.

[5] Jack J. Dongarra,et al. Automated empirical optimizations of software and the ATLAS project , 2001, Parallel Comput..

[6] Endong Wang,et al. Intel Math Kernel Library , 2014 .

[7] Michael Garland,et al. Implementing sparse matrix-vector multiplication on throughput-oriented processors , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[8] Xing Liu,et al. Efficient sparse matrix-vector multiplication on x86-based many-core processors , 2013, ICS '13.

[9] John R. Gilbert,et al. Parallel sparse matrix-vector and matrix-transpose-vector multiplication using compressed sparse blocks , 2009, SPAA '09.

[10] Nectarios Koziris,et al. Understanding the Performance of Sparse Matrix-Vector Multiplication , 2008, 16th Euromicro Conference on Parallel, Distributed and Network-Based Processing (PDP 2008).

[11] Mary W. Hall,et al. Loop and data transformations for sparse matrix code , 2015, PLDI.

[12] Samuel N. Kamin,et al. Jumbo: run-time code generation for Java and its applications , 2003, International Symposium on Code Generation and Optimization, 2003. CGO 2003..

[13] Kalyan Veeramachaneni,et al. Autotuning algorithmic choice for input sensitivity , 2015, PLDI.

[14] Andrew Lumsdaine,et al. Accelerating sparse matrix computations via data compression , 2006, ICS '06.

[15] Katherine Yelick,et al. OSKI: A library of automatically tuned sparse matrix kernels , 2005 .

[16] OlukotunKunle,et al. Optimizing data structures in high-level programs , 2013 .

[17] Richard W. Vuduc,et al. Model-driven autotuning of sparse matrix-vector multiply on GPUs , 2010, PPoPP '10.

[18] Timothy A. Davis,et al. The university of Florida sparse matrix collection , 2011, TOMS.

[19] Yousef Saad,et al. Iterative methods for sparse linear systems , 2003 .

[20] Yuefan Deng,et al. New trends in high performance computing , 2001, Parallel Computing.

[21] Nectarios Koziris,et al. Exploiting compression opportunities to improve SpMxV performance on shared memory systems , 2010, TACO.

[22] Ninghui Sun,et al. SMAT: an input adaptive auto-tuner for sparse matrix-vector multiplication , 2013, PLDI.

[23] Franz Franchetti,et al. SPIRAL: Code Generation for DSP Transforms , 2005, Proceedings of the IEEE.

[24] Alistair P. Rendell,et al. Runtime sparse matrix format selection , 2010, ICCS.

[25] John M. Mellor-Crummey,et al. Optimizing Sparse Matrix–Vector Product Computations Using Unroll and Jam , 2004, Int. J. High Perform. Comput. Appl..

[26] James Demmel,et al. Statistical Models for Empirical Search-Based Performance Tuning , 2004, Int. J. High Perform. Comput. Appl..

[27] Ting Wang,et al. Optimizing SpMV for Diagonal Sparse Matrices on GPU , 2011, 2011 International Conference on Parallel Processing.

[28] Prakash S. Raghavendra,et al. Predicting an Optimal Sparse Matrix Format for SpMV Computation on GPU , 2014, 2014 IEEE International Parallel & Distributed Processing Symposium Workshops.

[29] Keshav Pingali,et al. Next-generation generic programming and its application to sparse matrix computations , 2000, ICS '00.

[30] Ankit Jain. pOSKI : An Extensible Autotuning Framework to Perform Optimized SpMVs on Multicore Architectures , 2008 .

[31] Matteo Frigo,et al. A fast Fourier transform compiler , 1999, SIGP.

[32] Fred G. Gustavson,et al. Symbolic Generation of an Optimal Crout Algorithm for Sparse Systems of Linear Equations , 1970, JACM.

[33] Pascal Giorgi,et al. Generating Optimized Sparse Matrix Vector Product over Finite Fields , 2014, ICMS.

[34] Alistair P. Rendell,et al. Reinforcement learning for automated performance tuning: Initial evaluation for sparse matrix format selection , 2008, 2008 IEEE International Conference on Cluster Computing.

[35] Nectarios Koziris,et al. CSX: an extended compression format for spmv on shared memory systems , 2011, PPoPP '11.

[36] Jacques Carette,et al. Multi-stage programming with functors and monads: eliminating abstraction overhead from generic code , 2005, GPCE'05.

[37] David E. Keyes,et al. Towards Realistic Performance Bounds for Implicit CFD Codes , 2000 .

[38] Richard W. Vuduc,et al. Sparsity: Optimization Framework for Sparse Matrix Kernels , 2004, Int. J. High Perform. Comput. Appl..

[39] Samuel Williams,et al. Optimization of sparse matrix-vector multiplication on emerging multicore platforms , 2009, Parallel Comput..

[40] Katherine Yelick,et al. Autotuning Sparse Matrix-Vector Multiplication for Multicore , 2012 .

[41] Samuel Williams,et al. Optimization of sparse matrix-vector multiplication on emerging multicore platforms , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[42] Gaël Varoquaux,et al. Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[43] Yoshinari Fukui,et al. Supercomputing of circuits simulation , 1989, Supercomputing '89.

[44] Vikram S. Adve,et al. LLVM: a compilation framework for lifelong program analysis & transformation , 2004, International Symposium on Code Generation and Optimization, 2004. CGO 2004..

[45] Alistair P. Rendell,et al. Generating optimal CUDA sparse matrix–vector product implementations for evolving GPU hardware , 2012, Concurr. Comput. Pract. Exp..

[46] Calvin J. Ribbens,et al. A Library for Pattern-based Sparse Matrix Vector Multiply , 2011, International Journal of Parallel Programming.

[47] Julia L. Lawall,et al. A tour of Tempo: a program specializer for the C language , 2004, Sci. Comput. Program..

[48] Walid A. Abu-Sufah,et al. Auto-tuning of Sparse Matrix-Vector Multiplication on Graphics Processors , 2013, ISC.

[49] Kurt Keutzer,et al. clSpMV: A Cross-Platform OpenCL SpMV Framework on GPUs , 2012, ICS '12.

[50] Samuel Williams,et al. Reduced-Bandwidth Multithreaded Algorithms for Sparse Matrix-Vector Multiplication , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[51] Peter Lee,et al. Optimizing ML with run-time code generation , 1996, PLDI '96.

[52] Michael Garland,et al. Nitro: A Framework for Adaptive Code Variant Tuning , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[53] Eduardo F. D'Azevedo,et al. Vectorized Sparse Matrix Multiply for Compressed Row Storage Format , 2005, International Conference on Computational Science.