Optimization by runtime specialization for sparse matrix-vector multiplication

Runtime specialization optimizes programs based on partial information available only at run time. It is applicable when some input data is used repeatedly while other input data varies. This technique has the potential of generating highly efficient codes. In this paper, we explore the potential for obtaining speedups for sparse matrix-dense vector multiplication using runtime specialization, in the case where a single matrix is to be multiplied by many vectors. We experiment with five methods involving runtime specialization, comparing them to methods that do not (including Intel's MKL library). For this work, our focus is the evaluation of the speedups that can be obtained with runtime specialization without considering the overheads of the code generation. Our experiments use 23 matrices from the Matrix Market and Florida collections, and run on five different machines. In 94 of those 115 cases, the specialized code runs faster than any version without specialization. If we only use specialization, the average speedup with respect to Intel's MKL library ranges from 1.44x to 1.77x, depending on the machine. We have also found that the best method depends on the matrix and machine; no method is best for all matrices and machines.

[1]  Frank Pfenning,et al.  A modal analysis of staged computation , 1996, POPL '96.

[2]  Matteo Frigo,et al.  A fast Fourier transform compiler , 1999, SIGP.

[3]  Jack J. Dongarra,et al.  Automated empirical optimizations of software and the ATLAS project , 2001, Parallel Comput..

[4]  James Demmel,et al.  Performance Optimizations and Bounds for Sparse Matrix-Vector Multiply , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[5]  Walid Taha,et al.  Environment classifiers , 2003, POPL.

[6]  Samuel N. Kamin,et al.  Jumbo: run-time code generation for Java and its applications , 2003, International Symposium on Code Generation and Optimization, 2003. CGO 2003..

[7]  Dan Grossman,et al.  Compiling for template-based run-time code generation , 2003, Journal of Functional Programming.

[8]  Richard W. Vuduc,et al.  Sparsity: Optimization Framework for Sparse Matrix Kernels , 2004, Int. J. High Perform. Comput. Appl..

[9]  Franz Franchetti,et al.  SPIRAL: Code Generation for DSP Transforms , 2005, Proceedings of the IEEE.

[10]  Albert Cohen,et al.  Towards a High-Productivity and High-Performance Marshaling Library for Compound Data , 2005 .

[11]  Eduardo F. D'Azevedo,et al.  Vectorized Sparse Matrix Multiply for Compressed Row Storage Format , 2005, International Conference on Computational Science.

[12]  Victor Eijkhout,et al.  Self-Adapting Linear Algebra Algorithms and Software , 2005, Proceedings of the IEEE.

[13]  David A. Padua,et al.  Optimizing sorting with genetic algorithms , 2005, International Symposium on Code Generation and Optimization.

[14]  Samuel N. Kamin,et al.  Optimizing marshalling by run-time program generation , 2005, GPCE'05.

[15]  David A. Padua,et al.  In search of a program generator to implement generic transformations for high-performance computing , 2006, Sci. Comput. Program..

[16]  Samuel Williams,et al.  Optimization of sparse matrix-vector multiplication on emerging multicore platforms , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[17]  Ankit Jain pOSKI : An Extensible Autotuning Framework to Perform Optimized SpMVs on Multicore Architectures , 2008 .

[18]  John R. Gilbert,et al.  Parallel sparse matrix-vector and matrix-transpose-vector multiplication using compressed sparse blocks , 2009, SPAA '09.

[19]  Calvin J. Ribbens,et al.  Pattern-based sparse matrix representation for memory-efficient SMVM kernels , 2009, ICS.

[20]  Walid Taha,et al.  Mint: Java multi-stage programming using weak separability , 2010, PLDI '10.

[21]  Samuel Williams,et al.  Reduced-Bandwidth Multithreaded Algorithms for Sparse Matrix-Vector Multiplication , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[22]  Makoto Tatsuta,et al.  Static analysis of multi-staged programs via unstaging translation , 2011, POPL '11.

[23]  Jacques Carette,et al.  Multi-stage programming with functors and monads: eliminating abstraction overhead from generic code , 2005, GPCE'05.

[24]  Timothy A. Davis,et al.  The university of Florida sparse matrix collection , 2011, TOMS.

[25]  Nectarios Koziris,et al.  CSX: an extended compression format for spmv on shared memory systems , 2011, PPoPP '11.

[26]  Luke N. Olson,et al.  Exposing Fine-Grained Parallelism in Algebraic Multigrid Methods , 2012, SIAM J. Sci. Comput..

[27]  Chung-chieh Shan,et al.  Shonan challenge for generative programming: short position paper , 2013, PEPM '13.

[28]  Luke N. Olson,et al.  Optimizing Sparse Matrix—Matrix Multiplication for the GPU , 2015, ACM Trans. Math. Softw..