Avoiding Communication in Computing Krylov Subspaces

Our goal is to minimize the communication costs of Krylov Subspace Methods (KSMs) to solve either Ax = b or Ax = λx, when A is a large sparse matrix. By communication costs we mean both bandwidth and latency costs, either between processors on a parallel computer, or between levels of a memory hierarchy on a sequential computer. As the cost of communication is growing exponentially relative to computation on modern computers, reducing communication is becoming ever more important. It is possible to reduce communication costs to near their theoretical minima: On a parallel computer this means latency costs are independent of the dimension k of the Krylov subspace, as opposed to growing proportionally to k for a conventional implementation. On a sequential computer, this mean latency and bandwidth costs are independent of k. Achieving this speedup requires a new algorithmic formulation of KSMs. In this paper we present just part of this new formulation, namely computing a basis of the Krylov subspace spanned by [x,Ax,Ax, ...Ax]; other papers will present the rest of the KSM formulation. We present theory, performance models, and computational results: Our parallel performance models of 2D and 3D model problems predict speedups over a conventional algorithm of up to 15x on a model Petaflop machine, and up to 22x on a model of the Grid. Our sequential performance models of the same model problems predict speedups over a conventional algorithm of up to 10x on an out-of-core implementation, and up to 2.5x on Intel Clovertown, where we use our ideas to reduce off-chip latency and bandwidth to DRAM. Finally, we measured a speedup of over 3x on an actual out-of-core implementation. We also consider the related kernel [Ax,MAx,AMAx, ..., A(MA)k−1x, (MA)x], which arises in preconditioned KSMs. Under certain mathematical conditions on A and the preconditioner M , we show how to avoid latency and bandwidth for this kernel as well. ∗Computer Science Division and Mathematics Dept., University of California, Berkeley, CA 94720 (demmel@cs.berkeley.edu). †Computer Science Division, University of California, Berkeley, CA 94720 (mhoemmen@eecs.berkeley.edu). ‡EECS Department, University of California, Berkeley, CA 94720 (marghoob@eecs.berkeley.edu). §Computer Science Division, University of California, Berkeley, CA 94720 (yelick@cs.berkeley.edu).

[1]  H. T. Kung,et al.  I/O complexity: The red-blue pebble game , 1981, STOC '81.

[2]  John Van Rosendale Minimizing Inner Product Data Dependencies in Conjugate Gradient Iteration , 1983, ICPP.

[3]  Y. Saad,et al.  Practical Use of Polynomial Preconditionings for the Conjugate Gradient Method , 1985 .

[4]  J K Peir Program partitioning and synchronization on multiprocessor systems , 1986 .

[5]  François Irigoin,et al.  Supernode partitioning , 1988, POPL '88.

[6]  Anthony T. Chronopoulos,et al.  s-step iterative methods for symmetric linear systems , 1989 .

[7]  E. Sturler A PARALLEL VARIANT OF GMRES(m) , 1991 .

[8]  Michael E. Wolf,et al.  Improving locality and parallelism in nested loops , 1992 .

[9]  Sivan Toledo,et al.  Efficient out-of-core algorithms for linear relaxation using blocking covers , 1993, Proceedings of 1993 IEEE 34th Annual Foundations of Computer Science.

[10]  L. Reichel,et al.  A Newton basis GMRES implementation , 1994 .

[11]  Sivan Toledo,et al.  Quantitative performance modeling of scientific computations and creating locality in numerical algorithms , 1995 .

[12]  Frank Thomson Leighton,et al.  Automatic methods for hiding latency in high bandwidth networks (extended abstract) , 1996, STOC '96.

[13]  F. P. Preparata,et al.  Processor—Time Tradeoffs under Bounded-Speed Message Propagation: Part I, Upper Bounds , 1995, Theory of Computing Systems.

[14]  William W. Pugh,et al.  Fine-grained analysis of array computations , 1998 .

[15]  John D. McCalpin,et al.  Time Skewing: A Value-Based Approach to Optimizing for Memory Locality , 1999 .

[16]  Franco P. Preparata,et al.  Processor—Time Tradeoffs under Bounded-Speed Message Propagation: Part II, Lower Bounds , 1999, Theory of Computing Systems.

[17]  Zhiyuan Li,et al.  New tiling techniques to improve cache temporal locality , 1999, PLDI '99.

[18]  W. Hackbusch,et al.  On H2-Matrices , 2000 .

[19]  David G. Wonnacott,et al.  Using time skewing to eliminate idle time due to memory bandwidth and network limitations , 2000, Proceedings 14th International Parallel and Distributed Processing Symposium. IPDPS 2000.

[20]  Ulrich Rüde,et al.  Cache Optimization for Structured and Unstructured Grid Multigrid , 2000 .

[21]  Larry Carter,et al.  Rescheduling for Locality in Sparse Matrix Computations , 2001, International Conference on Computational Science.

[22]  正人 木村 Max-Planck-Institute for Mathematics in the Sciences(海外,ラボラトリーズ) , 2001 .

[23]  Chris H. Q. Ding,et al.  A ghost cell expansion method for reducing communications in solving PDE problems , 2001, SC.

[24]  Wolfgang Hackbusch,et al.  Construction and Arithmetics of H-Matrices , 2003, Computing.

[25]  Katherine Yelick,et al.  UPC: Distributed Shared-Memory Programming , 2003 .

[26]  Richard Vuduc,et al.  Automatic performance tuning of sparse matrix kernels , 2003 .

[27]  Leonid Oliker,et al.  Impact of modern memory subsystems on cache optimizations for stencil computations , 2005, MSP '05.

[28]  Katherine Yelick,et al.  OSKI: A library of automatically tuned sparse matrix kernels , 2005 .

[29]  David A. Patterson,et al.  Latency Lags Bandwidth , 2005, ICCD.

[30]  Telecommunications Board,et al.  Getting Up to Speed: The Future of Supercomputing , 2005 .

[31]  Samuel Williams,et al.  Implicit and explicit optimizations for stencil computations , 2006, MSPC '06.

[32]  Samuel Williams,et al.  Optimization of sparse matrix-vector multiplication on emerging multicore platforms , 2009, Parallel Comput..