Communication-optimal iterative methods

Data movement, both within the memory system of a single processor node and between multiple nodes in a system, limits the performance of many Krylov subspace methods that solve sparse linear systems and eigenvalue problems. Here, s iterations of algorithms such as CG, GMRES, Lanczos, and Arnoldi perform s sparse matrix-vector multiplications and ?(s) vector reductions, resulting in a growth of ?(s) in both single-node and network communication. By reorganizing the sparse matrix kernel to compute a set of matrix-vector products at once and reorganizing the rest of the algorithm accordingly, we can perform s iterations by sending O(log P) messages instead of ?(s?log P) messages on a parallel machine, and reading the on-node components of the matrix A from DRAM to cache just once on a single node instead of s times. This reduces communication to the minimum possible. We discuss both algorithms and an implementation of GMRES on a single node of an 8-core Intel Clovertown. Our implementations achieve significant speedups over the conventional algorithms.