pOSKI : An Extensible Autotuning Framework to Perform Optimized SpMVs on Multicore Architectures

We have developed pOSKI: the Parallel Optimized Sparse Kernel Interface – an autotuning framework to optimize Sparse Matrix Vector Multiply (SpMV) performance on emerging shared memory multicore architectures. Our autotuning methodology extends previous work done in the scientific computing community targeting serial architectures. In addition to previously explored parallel optimizations, we find that that load balanced data decomposition is extremely important to achieving good parallel performance on the new generation of parallel architectures. Our best parallel configurations perform up to 9x faster than optimized serial codes on the AMD Santa Rosa architecture, 11.3x faster on the AMD Barcelona architecture, and 7.2x faster on the Intel Clovertown architecture.

[1]  Youcef Saad,et al.  A Basic Tool Kit for Sparse Matrix Computations , 1990 .

[2]  James Demmel,et al.  Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology , 1997, ICS '97.

[3]  Jack J. Dongarra,et al.  Automatically Tuned Linear Algebra Software , 1998, Proceedings of the IEEE/ACM SC98 Conference.

[4]  Eun Im,et al.  Optimizing the Performance of Sparse Matrix-Vector Multiplication , 2000 .

[5]  Dragan Mirkovic,et al.  An adaptive software library for fast Fourier transforms , 2000, ICS '00.

[6]  José M. F. Moura,et al.  Fast Automatic Generation of DSP Algorithms , 2001, International Conference on Computational Science.

[7]  James Demmel,et al.  Performance Optimizations and Bounds for Sparse Matrix-Vector Multiply , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[8]  Richard Vuduc,et al.  Automatic performance tuning of sparse matrix kernels , 2003 .

[9]  Robert Love,et al.  Linux Kernel Development , 2003 .

[10]  Yousef Saad,et al.  Iterative methods for sparse linear systems , 2003 .

[11]  Katherine Yelick,et al.  Performance Modeling and Analysis of Cache Blocking in Sparse Matrix Vector Multiply , 2004 .

[12]  Monica S. Lam,et al.  RETROSPECTIVE : Software Pipelining : An Effective Scheduling Technique for VLIW Machines , 1998 .

[13]  Katherine Yelick,et al.  OSKI: A library of automatically tuned sparse matrix kernels , 2005 .

[14]  Steven G. Johnson,et al.  The Design and Implementation of FFTW3 , 2005, Proceedings of the IEEE.

[15]  Andrew Lumsdaine,et al.  Accelerating sparse matrix computations via data compression , 2006, ICS '06.

[16]  Samuel Williams,et al.  The Landscape of Parallel Computing Research: A View from Berkeley , 2006 .

[17]  James Demmel,et al.  When cache blocking of sparse matrix vector multiply works and why , 2007, Applicable Algebra in Engineering, Communication and Computing.

[18]  Samuel Williams,et al.  Optimization of sparse matrix-vector multiplication on emerging multicore platforms , 2009, Parallel Comput..