Minimizing communication in sparse matrix solvers

Data communication within the memory system of a single processor node and between multiple nodes in a system is the bottleneck in many iterative sparse matrix solvers like CG and GMRES. Here k iterations of a conventional implementation perform k sparse-matrix-vector-multiplications and Ω(k) vector operations like dot products, resulting in communication that grows by a factor of Ω(k) in both the memory and network. By reorganizing the sparse-matrix kernel to compute a set of matrix-vector products at once and reorganizing the rest of the algorithm accordingly, we can perform k iterations by sending O(log P) messages instead of O(k · log P) messages on a parallel machine, and reading the matrix A from DRAM to cache just once, instead of k times on a sequential machine. This reduces communication to the minimum possible. We combine these techniques to form a new variant of GMRES. Our shared-memory implementation on an 8-core Intel Clovertown gets speedups of up to 4.3x over standard GMRES, without sacrificing convergence rate or numerical stability.

[1]  Y. Saad,et al.  GMRES: a generalized minimal residual algorithm for solving nonsymmetric linear systems , 1986 .

[2]  J. Cullum,et al.  A block Lanczos algorithm for computing the q algebraically largest eigenvalues and a corresponding eigenspace of large, sparse, real symmetric matrices , 1974, CDC 1974.

[3]  Mark Hoemmen,et al.  Communication-avoiding Krylov subspace methods , 2010 .

[4]  Erik Elmroth,et al.  Applying recursion to serial and parallel QR factorization leads to better performance , 2000, IBM J. Res. Dev..

[5]  E. Sturler A PARALLEL VARIANT OF GMRES(m) , 1991 .

[6]  Sivan Toledo,et al.  Quantitative performance modeling of scientific computations and creating locality in numerical algorithms , 1995 .

[7]  Jocelyne Erhel,et al.  A parallel GMRES version for general sparse matrices. , 1995 .

[8]  James Demmel,et al.  Avoiding communication in sparse matrix computations , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[9]  Samuel Williams,et al.  Optimization of sparse matrix-vector multiplication on emerging multicore platforms , 2009, Parallel Comput..

[10]  Richard Vuduc,et al.  Automatic performance tuning of sparse matrix kernels , 2003 .

[11]  Anthony T. Chronopoulos,et al.  s-step iterative methods for symmetric linear systems , 1989 .

[12]  R Vichnevetsky,et al.  IMACS '91: Proceedings of the IMACS World Congress on Computation and Applied Mathematics (13th) Held in Dublin, Ireland on July 22-26, 1991. Volume 2. Computational Fluid Dynamics and Wave Propagation, Parallel Computing, Concurrent and Supercomputing, Computational Physics/Computational Chemistry , 1991 .

[13]  D. O’Leary The block conjugate gradient algorithm and related methods , 1980 .

[14]  Katherine Yelick,et al.  Optimizing collective communication on multicores , 2009 .

[15]  E. Jason Riedy,et al.  Non-Negative Diagonals and High Performance on Low-Prole , 2008 .

[16]  Ronald B. Morgan,et al.  Implicitly Restarted GMRES and Arnoldi Methods for Nonsymmetric Systems of Equations , 2000, SIAM J. Matrix Anal. Appl..

[17]  Vipin Kumar,et al.  A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs , 1998, SIAM J. Sci. Comput..

[18]  Eric de Sturler,et al.  Recycling Krylov Subspaces for Sequences of Linear Systems , 2006, SIAM J. Sci. Comput..

[19]  J. Demmel,et al.  Avoiding Communication in Computing Krylov Subspaces , 2007 .

[20]  L. Reichel,et al.  A Newton basis GMRES implementation , 1994 .

[21]  H. Walker Implementation of the GMRES method using householder transformations , 1988 .

[22]  W. Joubert,et al.  Parallelizable restarted iterative methods for nonsymmetric linear systems. part I: Theory , 1992 .

[23]  Robert A. van de Geijn,et al.  Anatomy of high-performance matrix multiplication , 2008, TOMS.

[24]  Richard W. Vuduc,et al.  Sparsity: Optimization Framework for Sparse Matrix Kernels , 2004, Int. J. High Perform. Comput. Appl..