Communication Optimization of Iterative Sparse Matrix-Vector Multiply on GPUs and FPGAs

Trading communication with redundant computation can increase the silicon efficiency of FPGAs and GPUs in accelerating communication-bound sparse iterative solvers. While k iterations of the iterative solver can be unrolled to provide O(k) reduction in communication cost, the extent of this unrolling depends on the underlying architecture, its memory model, and the growth in redundant computation. This paper presents a systematic procedure to select this algorithmic parameter k, which provides communication-computation tradeoff on hardware accelerators like FPGA and GPU. We provide predictive models to understand this tradeoff and show how careful selection of k can lead to performance improvement that otherwise demands significant increase in memory bandwidth. On an Nvidia C2050 GPU, we demonstrate a 1.9×-42.6× speedup over standard iterative solvers for a range of benchmarks and that this speedup is limited by the growth in redundant computation. In contrast, for FPGAs, we present an architecture-aware algorithm that limits off-chip communication but allows communication between the processing cores. This reduces redundant computation and allows large k and hence higher speedups. Our approach for FPGA provides a 0.3×-4.4× speedup over same-generation GPU devices where k is picked carefully for both architectures for a range of benchmarks.

[1]  James Demmel,et al.  Minimizing communication in sparse matrix solvers , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[2]  Larry Carter,et al.  Sparse Tiling for Stationary Iterative Methods , 2004, Int. J. High Perform. Comput. Appl..

[3]  Wim Vanderbauwhede,et al.  High-Performance Computing Using FPGAs , 2013 .

[4]  Etienne de Klerk,et al.  Exploiting special structure in semidefinite programming: A survey of theory and applications , 2010, Eur. J. Oper. Res..

[5]  Sally A. McKee,et al.  Hitting the memory wall: implications of the obvious , 1995, CARN.

[6]  James Demmel,et al.  LAPACK Users' Guide, Third Edition , 1999, Software, Environments and Tools.

[7]  Samuel Williams,et al.  Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[8]  Kurt Keutzer,et al.  A Predictive Model for Solving Small Linear Algebra Problems in GPU Registers , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[9]  Anthony T. Chronopoulos,et al.  A class of Lanczos-like algorithms implemented on parallel computers , 1991, Parallel Comput..

[10]  Stephen J. Wright,et al.  Application of Interior-Point Methods to Model Predictive Control , 1998 .

[11]  emontmej,et al.  High Performance Computing , 2003, Lecture Notes in Computer Science.

[12]  Gene H. Golub,et al.  Matrix computations (3rd ed.) , 1996 .

[13]  A. Mercer Numerical Solution of Ordinary and Partial Differential Equations , 1963 .

[14]  Ramesh Subramonian,et al.  LogP: towards a realistic model of parallel computation , 1993, PPOPP '93.

[15]  Sivan Toledo,et al.  Quantitative performance modeling of scientific computations and creating locality in numerical algorithms , 1995 .

[16]  Hemalatha,et al.  High-Performance Computing using GPUs , 2013 .

[17]  Granville Sewell,et al.  The numerical solution of ordinary and partial differential equations , 2005 .

[18]  Y. Danieli Guide , 2005 .

[19]  Marc Snir,et al.  GETTING UP TO SPEED THE FUTURE OF SUPERCOMPUTING , 2004 .

[20]  James Demmel,et al.  Avoiding communication in sparse matrix computations , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[21]  Robert H. Halstead,et al.  Matrix Computations , 2011, Encyclopedia of Parallel Computing.

[22]  Samuel Williams,et al.  Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.