Tuning solution of large non-Hermitian linear systems on multiple graphics processing unit accelerated workstations

This work deals with the solution of large non-Hermitian linear systems on desktop workstations with multiple graphics processing units (GPUs). While our implementation is motivated by the need to accelerate volume conductor modeling for bioelectrical brain imaging, the problem itself is common in scientific computing. Whenever a complex partial differential equation is numerically solved, a typically non-Hermitian sparse complex linear system needs to be solved. For problem sizes in the millions, this can take a long time even with highly optimized CPU-based solvers. Our GPU-accelerated solver outperforms an optimized OpenMP-based reference running on two quad-core CPUs by a factor of up to 31× in single precision and up to 7× in double precision, at the cost of a very modest hardware upgrade of two dual-GPU GTX 295 graphics cards. A pair of stronger Fermi GPUs (GTX 480) achieves speedups of 30× in single precision and 15× in double precision.

[1]  Paulius Micikevicius,et al.  3D finite difference computation on GPUs using CUDA , 2009, GPGPU-2.

[2]  J. Demmel,et al.  Sun Microsystems , 1996 .

[3]  Enrique S. Quintana-Ortí,et al.  Exploiting the capabilities of modern GPUs for dense matrix computations , 2009 .

[4]  R. Sadleir,et al.  Modeling Skull Electrical Properties , 2007, Annals of Biomedical Engineering.

[5]  Michael Garland,et al.  Efficient Sparse Matrix-Vector Multiplication on CUDA , 2008 .

[6]  Guillaume Caumon,et al.  Concurrent number cruncher: a GPU implementation of a general sparse linear solver , 2009, Int. J. Parallel Emergent Distributed Syst..

[7]  Roberto Guerrieri,et al.  Triangular matrix inversion on Graphics Processing Unit , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[8]  Matthew G. Knepley,et al.  Preliminary Implementation of PETSc Using GPUs , 2013 .

[9]  Richard M. Leahy,et al.  BrainSuite: An Automated Cortical Surface Identification Tool , 2000, MICCAI.

[10]  B. Carpentieri,et al.  A class of linear solvers built on the Biconjugate A-Orthonormalization Procedure for solving unsymmetric linear systems , 2010 .

[11]  R H Bayford,et al.  Bioimpedance tomography (electrical impedance tomography). , 2006, Annual review of biomedical engineering.

[12]  James Demmel,et al.  Benchmarking GPUs to tune dense linear algebra , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[13]  Ioane Muni Toke,et al.  Parallel Iterative Linear Solvers on GPU: A Financial Engineering Case , 2010, 2010 18th Euromicro Conference on Parallel, Distributed and Network-based Processing.

[14]  Richard Barrett,et al.  Templates for the Solution of Linear Systems: Building Blocks for Iterative Methods , 1994, Other Titles in Applied Mathematics.

[15]  Joshua A. Anderson,et al.  General purpose molecular dynamics simulations fully implemented on graphics processing units , 2008, J. Comput. Phys..

[16]  L. Dagum,et al.  OpenMP: an industry standard API for shared-memory programming , 1998 .

[17]  G. Amdhal,et al.  Validity of the single processor approach to achieving large scale computing capabilities , 1967, AFIPS '67 (Spring).

[18]  Satoshi Matsuoka,et al.  Fast Conjugate Gradients with Multiple GPUs , 2009, ICCS.

[19]  Z.J. Koles,et al.  A High-Resolution Anisotropic Finite-Volume Head Model for EEG Source Analysis , 2006, 2006 International Conference of the IEEE Engineering in Medicine and Biology Society.

[20]  James Demmel,et al.  Benchmarking GPUs to tune dense linear algebra , 2008, HiPC 2008.

[21]  Wolfgang Straßer,et al.  A Parallel Preconditioned Conjugate Gradient Solver for the Poisson Problem on a Multi-GPU Platform , 2010, 2010 18th Euromicro Conference on Parallel, Distributed and Network-based Processing.

[22]  Robert Strzodka,et al.  Cyclic Reduction Tridiagonal Solvers on GPUs Applied to Mixed-Precision Multigrid , 2011, IEEE Transactions on Parallel and Distributed Systems.

[23]  H. Calandra,et al.  Parallel Auto-tuned GMRES Method to Solve Complex Non-Hermitian Linear Systems , 2010 .

[24]  Bart Vanrumste,et al.  Review on solving the forward problem in EEG source analysis , 2007, Journal of NeuroEngineering and Rehabilitation.

[25]  Michal Mrozowski,et al.  Krylov space iterative solvers on graphics processing units , 2010, 18-th INTERNATIONAL CONFERENCE ON MICROWAVES, RADAR AND WIRELESS COMMUNICATIONS.

[26]  Akila Gothandaraman,et al.  Comparing Hardware Accelerators in Scientific Applications: A Case Study , 2011, IEEE Transactions on Parallel and Distributed Systems.

[27]  David R. Kaeli,et al.  Exploring the multiple-GPU design space , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.