Avoiding communication in sparse matrix computations

The performance of sparse iterative solvers is typically limited by sparse matrix-vector multiplication, which is itself limited by memory system and network performance. As the gap between computation and communication speed continues to widen, these traditional sparse methods will suffer. In this paper we focus on an alternative building block for sparse iterative solvers, the "matrix powers kernel" [x, Ax, A2x, ..., Akx], and show that by organizing computations around this kernel, we can achieve near-minimal communication costs. We consider communication very broadly as both network communication in parallel code and memory hierarchy access in sequential code. In particular, we introduce a parallel algorithm for which the number of messages (total latency cost) is independent of the power k, and a sequential algorithm, that reduces both the number and volume of accesses, so that it is independent of k in both latency and bandwidth costs. This is part of a larger project to develop "communication-avoiding Krylov subspace methods," which also addresses the numerical issues associated with these methods. Our algorithms work for general sparse matrices that "partition well". We introduce parallel performance models of matrices arising from 2D and 3D problems and show predicted speedups over a conventional algorithm of up to 7times on a petaflop-scale machine and up to 22times on computation across the grid. Analogous sequential performance models of the same problems predict speedups over a conventional algorithm of up to 10times on an out-of-core implementation, and up to 2.5times when we use our ideas to reduce off-chip latency and bandwidth to DRAM. Finally, we validate the model on an out-of-core sequential implementation and measured a speedup of over 3times, which is close to the predicted speedup.

[1]  H. T. Kung,et al.  I/O complexity: The red-blue pebble game , 1981, STOC '81.

[2]  John Van Rosendale Minimizing Inner Product Data Dependencies in Conjugate Gradient Iteration , 1983, ICPP.

[3]  Y. Saad,et al.  Practical Use of Polynomial Preconditionings for the Conjugate Gradient Method , 1985 .

[4]  J K Peir Program partitioning and synchronization on multiprocessor systems , 1986 .

[5]  François Irigoin,et al.  Supernode partitioning , 1988, POPL '88.

[6]  Anthony T. Chronopoulos,et al.  s-step iterative methods for symmetric linear systems , 1989 .

[7]  R Vichnevetsky,et al.  IMACS '91: Proceedings of the IMACS World Congress on Computation and Applied Mathematics (13th) Held in Dublin, Ireland on July 22-26, 1991. Volume 2. Computational Fluid Dynamics and Wave Propagation, Parallel Computing, Concurrent and Supercomputing, Computational Physics/Computational Chemistry , 1991 .

[8]  E. Sturler A PARALLEL VARIANT OF GMRES(m) , 1991 .

[9]  Michael E. Wolf,et al.  Improving locality and parallelism in nested loops , 1992 .

[10]  Sivan Toledo,et al.  Efficient out-of-core algorithms for linear relaxation using blocking covers , 1993, Proceedings of 1993 IEEE 34th Annual Foundations of Computer Science.

[11]  L. Reichel,et al.  A Newton basis GMRES implementation , 1994 .

[12]  Sivan Toledo,et al.  Quantitative performance modeling of scientific computations and creating locality in numerical algorithms , 1995 .

[13]  William W. Pugh,et al.  Fine-grained analysis of array computations , 1998 .

[14]  John D. McCalpin,et al.  Time Skewing: A Value-Based Approach to Optimizing for Memory Locality , 1999 .

[15]  Zhiyuan Li,et al.  New tiling techniques to improve cache temporal locality , 1999, PLDI '99.

[16]  David G. Wonnacott,et al.  Using time skewing to eliminate idle time due to memory bandwidth and network limitations , 2000, Proceedings 14th International Parallel and Distributed Processing Symposium. IPDPS 2000.

[17]  Ulrich Rüde,et al.  Cache Optimization for Structured and Unstructured Grid Multigrid , 2000 .

[18]  Larry Carter,et al.  Rescheduling for Locality in Sparse Matrix Computations , 2001, International Conference on Computational Science.

[19]  Chris H. Q. Ding,et al.  A ghost cell expansion method for reducing communications in solving PDE problems , 2001, SC.

[20]  Katherine Yelick,et al.  UPC: Distributed Shared-Memory Programming , 2003 .

[21]  Richard Vuduc,et al.  Automatic performance tuning of sparse matrix kernels , 2003 .

[22]  David A. Patterson,et al.  Latency lags bandwith , 2004, CACM.

[23]  Leonid Oliker,et al.  Impact of modern memory subsystems on cache optimizations for stencil computations , 2005, MSP '05.

[24]  Katherine Yelick,et al.  OSKI: A library of automatically tuned sparse matrix kernels , 2005 .

[25]  David A. Patterson,et al.  Latency Lags Bandwidth , 2005, ICCD.

[26]  Telecommunications Board,et al.  Getting Up to Speed: The Future of Supercomputing , 2005 .

[27]  Samuel Williams,et al.  Implicit and explicit optimizations for stencil computations , 2006, MSPC '06.

[28]  J. Demmel,et al.  Avoiding Communication in Computing Krylov Subspaces , 2007 .

[29]  Samuel Williams,et al.  Optimization of sparse matrix-vector multiplication on emerging multicore platforms , 2009, Parallel Comput..