On the limits of GPU acceleration

This paper throws a small "wet blanket" on the hot topic of GPGPU acceleration, based on experience analyzing and tuning both multithreaded CPU and GPU implementations of three computations in scientific computing. These computations--(a) iterative sparse linear solvers; (b) sparse Cholesky factorization; and (c) the fast multipole method--exhibit complex behavior and vary in computational intensity and memory reference irregularity. In each case, algorithmic analysis and prior work might lead us to conclude that an idealized GPU can deliver better performance, but we find that for at least equal-effort CPU tuning and consideration of realistic workloads and calling-contexts, we can with two modern quad-core CPU sockets roughly match one or two GPUs in performance. Our conclusions are not intended to dampen interest in GPU acceleration; on the contrary, they should do the opposite: they partially illuminate the boundary between CPU and GPU performance, and ask architects to consider application contexts in the design of future coupled on-die CPU/GPU processors.

[1]  Leslie Greengard,et al.  A fast algorithm for particle simulations , 1987 .

[2]  James Demmel,et al.  Applied Numerical Linear Algebra , 1997 .

[3]  Lexing Ying,et al.  A New Parallel Kernel-Independent Fast Multipole Method , 2003, ACM/IEEE SC 2003 Conference (SC'03).

[4]  D. Zorin,et al.  A kernel-independent adaptive fast multipole algorithm in two and three dimensions , 2004 .

[5]  Richard W. Vuduc,et al.  Sparsity: Optimization Framework for Sparse Matrix Kernels , 2004, Int. J. High Perform. Comput. Appl..

[6]  Katherine Yelick,et al.  OSKI: A library of automatically tuned sparse matrix kernels , 2005 .

[7]  Ramani Duraiswami,et al.  Fast multipole methods on graphics processors , 2008, J. Comput. Phys..

[8]  David Patterson,et al.  The Top 10 Innovations in the New NVIDIA Fermi Architecture, and the Top 3 Next Challenges , 2009 .

[9]  Richard W. Vuduc,et al.  Direct N-body Kernels for Multicore Platforms , 2009, 2009 International Conference on Parallel Processing.

[10]  Michael Garland,et al.  Implementing sparse matrix-vector multiplication on throughput-oriented processors , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[11]  Samuel Williams,et al.  Optimization of sparse matrix-vector multiplication on emerging multicore platforms , 2009, Parallel Comput..

[12]  Richard W. Vuduc,et al.  A massively parallel adaptive fast-multipole method on heterogeneous architectures , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[13]  Samuel Williams,et al.  Optimizing and tuning the fast multipole method for state-of-the-art multicore architectures , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[14]  Richard W. Vuduc,et al.  Model-driven autotuning of sparse matrix-vector multiply on GPUs , 2010, PPoPP '10.

[15]  Samuel Williams,et al.  Sparse Matrix-Vector Multiplication on Multicore and Accelerators , 2010 .

[16]  Murat Efe Guney,et al.  High-performance direct solution of finite element problems on multi-core processors , 2010 .

[17]  Eric Darve,et al.  Assembly of finite element methods on graphics processors , 2011 .