PDE solvers for hybrid CPU-GPU architectures

Michael Malahe: PDE Solvers for Hybrid CPU-GPU Architectures (Under the direction of Sorin Mitran) Many problems of scientific and industrial interest are investigated through numerically solving partial differential equations (PDEs). For some of these problems, the scope of the investigation is limited by the costs of computational resources. A new approach to reducing these costs is the use of coprocessors, such as graphics processing units (GPUs) and Many Integrated Core (MIC) cards, which can execute floating point operations at a higher rate than a central processing unit (CPU) of the same cost. This is achieved through the use of a large number of processors in a single device, each with very limited dedicated memory per thread. Codes for a number of continuum methods, such as boundary element methods (BEM), finite element methods (FEM) and finite difference methods (FDM) have already been implemented on coprocessor architectures. These methods were designed before the adoption of coprocessor architectures, so implementing them efficiently with reduced thread-level memory can be challenging. There are other methods that do operate efficiently with limited thread-level memory, such as Monte Carlo methods (MCM) and lattice Boltzmann methods (LBM) for kinetic formulations of PDEs, but they are not competitive on CPUs and generally have poorer convergence than the continuum methods. In this work, we introduce a class of methods in which the parallelism of kinetic formulations on GPUs is combined with the better convergence of continuum methods on CPUs. We first extend an existing Feynman-Kac formulation for determining the principal eigenpair of an elliptic operator to create a version that can retrieve arbitrarily many eigenpairs. This new method is implemented for multiple GPUs, and combined with a standard deflation preconditioner on multiple CPUs to create a hybrid concurrent method with superior convergence to that of the deflation preconditioner alone. The hybrid method exhibits good parallelism, with an efficiency of 80% on a problem with 300 million unknowns, run on a configuration of 324 CPU cores and 54 GPUs.

[1]  Antoine Lejay,et al.  Computing the principal eigenelements of some linear operators using a branching Monte Carlo method , 2008, J. Comput. Phys..

[2]  R. Morgan Computing Interior Eigenvalues of Large Matrices , 1991 .

[3]  Alex Ramírez,et al.  The low-power architecture approach towards exascale computing , 2011, ScalA '11.

[4]  Stefan Turek,et al.  GPU acceleration of an unmodified parallel finite element Navier-Stokes solver , 2009, 2009 International Conference on High Performance Computing & Simulation.

[5]  K. Law A parallel finite element solution method , 1986 .

[6]  Rezaur Rahman,et al.  Intel Xeon Phi Coprocessor Architecture and Tools: The Guide for Application Developers , 2013 .

[7]  Charbel Farhat,et al.  An Unconventional Domain Decomposition Method for an Efficient Parallel Solution of Large-Scale Finite Element Systems , 1992, SIAM J. Sci. Comput..

[8]  Onkar Sahni,et al.  Scalable implicit finite element solver for massively parallel processing with demonstration to 160K cores , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[9]  Benjamin S. Kirk,et al.  Library for Parallel Adaptive Mesh Refinement / Coarsening Simulations , 2006 .

[10]  STEVE SCHAFFER,et al.  A Semicoarsening Multigrid Method for Elliptic Partial Differential Equations with Highly Discontinuous and Anisotropic Coefficients , 1998, SIAM J. Sci. Comput..

[11]  Jonathan Chang,et al.  A 45 nm 8-Core Enterprise Xeon¯ Processor , 2009, IEEE Journal of Solid-State Circuits.

[12]  Y. Saad,et al.  GMRES: a generalized minimal residual algorithm for solving nonsymmetric linear systems , 1986 .

[13]  W. Rüemelin Numerical Treatment of Stochastic Differential Equations , 1982 .

[14]  S. Ashby,et al.  A parallel multigrid preconditioned conjugate gradient algorithm for groundwater flow simulations , 1996 .

[15]  Robert D. Falgout,et al.  Multigrid on massively parallel architectures , 2000 .

[16]  Rafael Mayo,et al.  Solving Dense Linear Systems on Graphics Processors , 2008, Euro-Par.

[17]  Pheng-Ann Heng,et al.  A hybrid condensed finite element model with GPU acceleration for interactive 3D soft tissue cutting , 2004, Comput. Animat. Virtual Worlds.

[18]  Danny C. Sorensen,et al.  Implicit Application of Polynomial Filters in a k-Step Arnoldi Method , 1992, SIAM J. Matrix Anal. Appl..

[19]  Timothy C. Warburton,et al.  Nodal discontinuous Galerkin methods on graphics processors , 2009, J. Comput. Phys..

[20]  Karsten Schwan,et al.  Efficient Wire Formats for High Performance Computing , 2000, ACM/IEEE SC 2000 Conference (SC'00).

[21]  S. Eisenstat,et al.  Variational Iterative Methods for Nonsymmetric Systems of Linear Equations , 1983 .

[22]  Nigel J. Newton Asymptotically efficient Runge-Kutta methods for a class of ITOˆ and Stratonovich equations , 1991 .

[23]  Marek Behr,et al.  Parallel finite-element computation of 3D flows , 1993, Computer.

[24]  Pradeep Dubey,et al.  Design and Implementation of the Linpack Benchmark for Single and Multi-node Systems Based on Intel® Xeon Phi Coprocessor , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[25]  M. Kac On distributions of certain Wiener functionals , 1949 .

[26]  Gordon Erlebacher,et al.  High-order finite-element seismic wave propagation modeling with MPI on a large GPU cluster , 2010, J. Comput. Phys..

[27]  Hannes Vogt,et al.  Coulomb, Landau and maximally Abelian gauge fixing in lattice QCD with multi-GPUs , 2012, Comput. Phys. Commun..

[28]  Jun Zhou,et al.  Multi-GPU Implementation of a 3D Finite Difference Time Domain Earthquake Code on Heterogeneous Supercomputers , 2013, ICCS.

[29]  Mircea Grigoriu,et al.  Random walk method for the two‐ and three‐dimensional Laplace, Poisson and Helmholtz's equations , 2001 .

[30]  Jonathan Ennis-King,et al.  Effect of Vertical Heterogeneity on Long-Term Migration of CO2 in Saline Formations , 2010 .

[31]  Chao-Tung Yang,et al.  Hybrid CUDA, OpenMP, and MPI parallel programming on multicore GPU clusters , 2011, Comput. Phys. Commun..

[32]  Roy H. Stogner,et al.  Early Experiences Porting Scientific Applications to the Many Integrated Core ( MIC ) Platform , 2012 .

[33]  Alejandro Duran,et al.  The Intel® Many Integrated Core Architecture , 2012, 2012 International Conference on High Performance Computing & Simulation (HPCS).

[34]  Inanc Senocak,et al.  An MPI-CUDA Implementation for Massively Parallel Incompressible Flow Computations on Multi-GPU Clusters , 2010 .

[35]  J.,et al.  EFFICIENT PRECONDITIONING FOR THE p-VERSION FINITE ELEMENT METHOD IN TWO DIMENSIONS , .

[36]  Martin Kronbichler,et al.  Algorithms and data structures for massively parallel generic adaptive finite element codes , 2011, ACM Trans. Math. Softw..

[37]  Eric Darve,et al.  Assembly of finite element methods on graphics processors , 2011 .

[38]  Karol Miller,et al.  Real-Time Nonlinear Finite Element Computations on GPU - Application to Neurosurgical Simulation. , 2010, Computer methods in applied mechanics and engineering.

[39]  Pradeep Dubey,et al.  Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU , 2010, ISCA.

[40]  Manish Parashar,et al.  Solving Sparse Linear Systems on NVIDIA Tesla GPUs , 2009, ICCS.

[41]  Gene H. Golub,et al.  Adaptively Preconditioned GMRES Algorithms , 1998, SIAM J. Sci. Comput..

[42]  Jack J. Dongarra,et al.  A Step towards Energy Efficient Computing: Redesigning a Hydrodynamic Application on CPU-GPU , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[43]  Michal Mrozowski,et al.  FINITE ELEMENT MATRIX GENERATION ON A GPU , 2012 .

[44]  Andreas Rößler,et al.  Runge-Kutta Methods for the Strong Approximation of Solutions of Stochastic Differential Equations , 2010, SIAM J. Numer. Anal..

[45]  Marcus J. Grote,et al.  Parallel Preconditioning with Sparse Approximate Inverses , 1997, SIAM J. Sci. Comput..

[46]  Ronald B. Morgan,et al.  A Restarted GMRES Method Augmented with Eigenvectors , 1995, SIAM J. Matrix Anal. Appl..

[47]  Pradeep Dubey,et al.  Larrabee: A Many-Core x86 Architecture for Visual Computing , 2009, IEEE Micro.

[48]  Collin McCurdy,et al.  The Scalable Heterogeneous Computing (SHOC) benchmark suite , 2010, GPGPU-3.

[49]  V. S. Manoranjan,et al.  A two-step Jacobi-type iterative method , 1997 .

[50]  S. Tam,et al.  A 65-nm Dual-Core Multithreaded Xeon® Processor With 16-MB L3 Cache , 2007, IEEE Journal of Solid-State Circuits.

[51]  Andrew A. Chien,et al.  The future of microprocessors , 2011, Commun. ACM.

[52]  K. Burrage,et al.  Restarted GMRES preconditioned by deflation , 1996 .

[53]  G. Milstein Numerical Integration of Stochastic Differential Equations , 1994 .

[54]  Antoine Lejay,et al.  Computing the principal eigenvalue of the Laplace operator by a stochastic method , 2007, Math. Comput. Simul..

[55]  Mark A. Moraes,et al.  Parallel random numbers: As easy as 1, 2, 3 , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[56]  Jianbin Fang,et al.  An Empirical Study of Intel Xeon Phi , 2013, ArXiv.

[57]  Wei Chen,et al.  A 22 nm 15-Core Enterprise Xeon® Processor Family , 2015, IEEE Journal of Solid-State Circuits.

[58]  Pradeep Dubey,et al.  Designing and dynamically load balancing hybrid LU for multi/many-core , 2011, Computer Science - Research and Development.

[59]  Markus Clemens,et al.  Scalability of Higher-Order Discontinuous Galerkin FEM Computations for Solving Electromagnetic Wave Propagation Problems on GPU Clusters , 2010, IEEE Transactions on Magnetics.

[60]  Cornelis W. Oosterlee,et al.  FOURIER ANALYSIS OF GMRES ( m ) PRECONDITIONED BY MULTIGRID , 2000 .

[61]  Shiyi Chen,et al.  LATTICE BOLTZMANN METHOD FOR FLUID FLOWS , 2001 .

[62]  Georg Stadler,et al.  Scalable adaptive mantle convection simulation on petascale supercomputers , 2008, HiPC 2008.

[63]  Georg Stadler,et al.  Towards adaptive mesh PDE simulations on petascale computers , 2008 .

[64]  M. Embree How Descriptive are GMRES Convergence Bounds? , 1999, ArXiv.

[65]  Gordon Erlebacher,et al.  Porting a high-order finite-element earthquake modeling application to NVIDIA graphics cards using CUDA , 2009, J. Parallel Distributed Comput..

[66]  Robert D. Falgout,et al.  Semicoarsening Multigrid on Distributed Memory Machines , 1999, SIAM J. Sci. Comput..

[67]  James Reinders,et al.  Intel Xeon Phi Coprocessor High Performance Programming , 2013 .

[68]  Andreas Rößler Second Order Runge-Kutta Methods for Itô Stochastic Differential Equations , 2009, SIAM J. Numer. Anal..

[69]  Jirí Jaros,et al.  Multi-GPU island-based genetic algorithm for solving the knapsack problem , 2012, 2012 IEEE Congress on Evolutionary Computation.

[70]  K. Burrage,et al.  On the Performance of Various Adaptive Preconditioned GMRES Strategies , 1998 .

[71]  Valeria Simoncini,et al.  On the Convergence of Restarted Krylov Subspace Methods , 2000, SIAM J. Matrix Anal. Appl..

[72]  Christopher Baker,et al.  High performance radiation transport simulations: Preparing for TITAN , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[73]  M. V. Tretyakov,et al.  Stochastic Numerics for Mathematical Physics , 2004, Scientific Computation.

[74]  P. Fischer,et al.  Petascale algorithms for reactor hydrodynamics , 2008 .

[75]  Peter Messmer,et al.  Forward and adjoint simulations of seismic wave propagation on emerging large-scale GPU architectures , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[76]  Cornelis Vuik,et al.  A Comparison of Deflation and Coarse Grid Correction Applied to Porous Media Flow , 2004, SIAM J. Numer. Anal..

[77]  Erik Lindholm,et al.  NVIDIA Tesla: A Unified Graphics and Computing Architecture , 2008, IEEE Micro.

[78]  Desmond J. Higham,et al.  An Algorithmic Introduction to Numerical Simulation of Stochastic Differential Equations , 2001, SIAM Rev..

[79]  Robert A. van de Geijn,et al.  Level-3 BLAS on a GPU: Picking the low hanging fruit , 2012 .

[80]  C. Schwab,et al.  Boundary Element Methods , 2010 .

[81]  Konstantinos I. Karantasis,et al.  Acceleration of a Finite-Difference WENO Scheme for Large-Scale Simulations on Many-Core Architectures , 2010 .