An efficient mixed-precision, hybrid CPU-GPU implementation of a nonlinearly implicit one-dimensional particle-in-cell algorithm

Recently, an implicit, nonlinearly consistent, energy- and charge-conserving one-dimensional (1D) particle-in-cell method has been proposed for multi-scale, full-f kinetic simulations [G. Chen et al., J. Comput. Phys. 230 (18) (2011)]. The method employs a Jacobian-free Newton-Krylov (JFNK) solver, capable of using very large timesteps without loss of numerical stability or accuracy. A fundamental feature of the method is the segregation of particle-orbit computations from the field solver, while remaining fully self-consistent. This paper describes a very efficient, mixed-precision hybrid CPU-GPU implementation of the 1D implicit PIC algorithm exploiting this feature. The JFNK solver is kept on the CPU in double precision (DP), while the implicit, charge-conserving, and adaptive particle mover is implemented on a GPU (graphics processing unit) using CUDA in single-precision (SP). Performance-oriented optimizations are introduced with the aid of the roofline model. The implicit particle mover algorithm is shown to achieve up to 400GOp/s on a Nvidia GeForce GTX580. This corresponds to 25% absolute GPU efficiency against the peak theoretical performance, and is about 100 times faster than an equivalent single-core CPU (Intel Xeon X5460) compiler-optimized execution. For the test case chosen, the mixed-precision hybrid CPU-GPU solver is shown to over-perform the DP CPU-only serial version by a factor of ~100, without apparent loss of robustness or accuracy in a challenging long-timescale ion acoustic wave simulation.

[1]  Viktor K. Decyk,et al.  Adaptable Particle-in-Cell algorithms for graphical processing units , 2010, Comput. Phys. Commun..

[2]  Samuel H. Fuller,et al.  The Future of Computing Performance: Game Over or Next Level? , 2014 .

[3]  Kurt Keutzer,et al.  The Concurrency Challenge , 2008, IEEE Design & Test of Computers.

[4]  H Burau,et al.  PIConGPU: A Fully Relativistic Particle-in-Cell Code for a GPU Cluster , 2010, IEEE Transactions on Plasma Science.

[5]  Rodney A. Kennedy,et al.  Efficient Histogram Algorithms for NVIDIA CUDA Compatible Devices , 2007 .

[6]  Viktor K. Decyk,et al.  A general concurrent algorithm for plasma particle-in-cell simulation codes , 1989 .

[7]  William Daughton,et al.  Advances in petascale kinetic plasma simulation with VPIC and Roadrunner , 2009 .

[8]  James Demmel,et al.  Benchmarking GPUs to tune dense linear algebra , 2008, HiPC 2008.

[9]  Sally A. McKee,et al.  Hitting the memory wall: implications of the obvious , 1995, CARN.

[10]  Andreas Moshovos,et al.  Demystifying GPU microarchitecture through microbenchmarking , 2010, 2010 IEEE International Symposium on Performance Analysis of Systems & Software (ISPASS).

[11]  K. Bohmer Defect Correction Methods: Theory and Applications , 1984 .

[12]  C. Birdsall,et al.  Plasma Physics via Computer Simulation , 2018 .

[13]  Samuel Williams,et al.  Roofline: an insightful visual performance model for multicore architectures , 2009, CACM.

[14]  J D Littler,et al.  A PROOF OF THE QUEUING FORMULA , 1961 .

[15]  Nail A. Gumerov,et al.  Fast parallel Particle-To-Grid interpolation for plasma PIC simulations on the GPU , 2008, J. Parallel Distributed Comput..

[16]  Emmett Kilgariff,et al.  Fermi GF100 GPU Architecture , 2011, IEEE Micro.

[17]  Wen-mei W. Hwu,et al.  Optimization principles and application performance evaluation of a multithreaded GPU using CUDA , 2008, PPoPP.

[18]  Kevin J. Bowers,et al.  Accelerating a paricle -in-cell simulation using a hybrid counting sort , 2001 .

[19]  Samuel Williams,et al.  Memory-efficient optimization of Gyrokinetic particle-to-grid interpolation for multicore processors , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[20]  Julien Langou,et al.  Exploiting Mixed Precision Floating Point Hardware in Scientific Computations , 2006, High Performance Computing Workshop.

[21]  Luis Chacón,et al.  An energy- and charge-conserving, implicit, electrostatic particle-in-cell algorithm , 2011, J. Comput. Phys..

[22]  William J. Dally,et al.  The GPU Computing Era , 2010, IEEE Micro.

[23]  Jie Cheng,et al.  Programming Massively Parallel Processors. A Hands-on Approach , 2010, Scalable Comput. Pract. Exp..

[24]  L. Shampine Error estimation and control for ODEs , 2005 .

[25]  Samuel Williams,et al.  Gyrokinetic toroidal simulations on leading multi- and manycore HPC systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[26]  Peter W. Markstein,et al.  IA-64 and elementary functions - speed and precision , 2000 .

[27]  Michael C. Huang,et al.  Particle-in-cell simulations with charge-conserving current deposition on graphic processing units , 2010, J. Comput. Phys..

[28]  Samuel Williams,et al.  Gyrokinetic particle-in-cell optimization on emerging multi- and manycore platforms , 2011, Parallel Comput..

[29]  David A. Patterson,et al.  Computer Organization and Design, Fourth Edition, Fourth Edition: The Hardware/Software Interface (The Morgan Kaufmann Series in Computer Architecture and Design) , 2008 .

[30]  Mark J. Harris,et al.  Parallel Prefix Sum (Scan) with CUDA , 2011 .

[31]  Benjamin Bergen,et al.  0.374 Pflop/s trillion-particle kinetic modeling of laser plasma interaction on roadrunner , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[32]  Norbert Luttenberger,et al.  A Novel Sorting Algorithm for Many-core Architectures Based on Adaptive Bitonic Sort , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[33]  J. Little A Proof for the Queuing Formula: L = λW , 1961 .

[34]  Wen-mei W. Hwu,et al.  Compute Unified Device Architecture Application Suitability , 2009, Computing in Science & Engineering.