论文信息 - An efficient mixed-precision, hybrid CPU-GPU implementation of a nonlinearly implicit one-dimensional particle-in-cell algorithm

An efficient mixed-precision, hybrid CPU-GPU implementation of a nonlinearly implicit one-dimensional particle-in-cell algorithm

Recently, an implicit, nonlinearly consistent, energy- and charge-conserving one-dimensional (1D) particle-in-cell method has been proposed for multi-scale, full-f kinetic simulations [G. Chen et al., J. Comput. Phys. 230 (18) (2011)]. The method employs a Jacobian-free Newton-Krylov (JFNK) solver, capable of using very large timesteps without loss of numerical stability or accuracy. A fundamental feature of the method is the segregation of particle-orbit computations from the field solver, while remaining fully self-consistent. This paper describes a very efficient, mixed-precision hybrid CPU-GPU implementation of the 1D implicit PIC algorithm exploiting this feature. The JFNK solver is kept on the CPU in double precision (DP), while the implicit, charge-conserving, and adaptive particle mover is implemented on a GPU (graphics processing unit) using CUDA in single-precision (SP). Performance-oriented optimizations are introduced with the aid of the roofline model. The implicit particle mover algorithm is shown to achieve up to 400GOp/s on a Nvidia GeForce GTX580. This corresponds to 25% absolute GPU efficiency against the peak theoretical performance, and is about 100 times faster than an equivalent single-core CPU (Intel Xeon X5460) compiler-optimized execution. For the test case chosen, the mixed-precision hybrid CPU-GPU solver is shown to over-perform the DP CPU-only serial version by a factor of ~100, without apparent loss of robustness or accuracy in a challenging long-timescale ion acoustic wave simulation.

[1] Viktor K. Decyk,et al. Adaptable Particle-in-Cell algorithms for graphical processing units , 2010, Comput. Phys. Commun..

[2] Samuel H. Fuller,et al. The Future of Computing Performance: Game Over or Next Level? , 2014 .

[3] Kurt Keutzer,et al. The Concurrency Challenge , 2008, IEEE Design & Test of Computers.

[4] H Burau,et al. PIConGPU: A Fully Relativistic Particle-in-Cell Code for a GPU Cluster , 2010, IEEE Transactions on Plasma Science.

[5] Rodney A. Kennedy,et al. Efficient Histogram Algorithms for NVIDIA CUDA Compatible Devices , 2007 .

[6] Viktor K. Decyk,et al. A general concurrent algorithm for plasma particle-in-cell simulation codes , 1989 .

[7] William Daughton,et al. Advances in petascale kinetic plasma simulation with VPIC and Roadrunner , 2009 .

[8] James Demmel,et al. Benchmarking GPUs to tune dense linear algebra , 2008, HiPC 2008.

[9] Sally A. McKee,et al. Hitting the memory wall: implications of the obvious , 1995, CARN.

[10] Andreas Moshovos,et al. Demystifying GPU microarchitecture through microbenchmarking , 2010, 2010 IEEE International Symposium on Performance Analysis of Systems & Software (ISPASS).

[11] K. Bohmer. Defect Correction Methods: Theory and Applications , 1984 .

[12] C. Birdsall,et al. Plasma Physics via Computer Simulation , 2018 .

[13] Samuel Williams,et al. Roofline: an insightful visual performance model for multicore architectures , 2009, CACM.

[14] J D Littler,et al. A PROOF OF THE QUEUING FORMULA , 1961 .

[15] Nail A. Gumerov,et al. Fast parallel Particle-To-Grid interpolation for plasma PIC simulations on the GPU , 2008, J. Parallel Distributed Comput..

[16] Emmett Kilgariff,et al. Fermi GF100 GPU Architecture , 2011, IEEE Micro.

[17] Wen-mei W. Hwu,et al. Optimization principles and application performance evaluation of a multithreaded GPU using CUDA , 2008, PPoPP.

[18] Kevin J. Bowers,et al. Accelerating a paricle -in-cell simulation using a hybrid counting sort , 2001 .

[19] Samuel Williams,et al. Memory-efficient optimization of Gyrokinetic particle-to-grid interpolation for multicore processors , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[20] Julien Langou,et al. Exploiting Mixed Precision Floating Point Hardware in Scientific Computations , 2006, High Performance Computing Workshop.

[21] Luis Chacón,et al. An energy- and charge-conserving, implicit, electrostatic particle-in-cell algorithm , 2011, J. Comput. Phys..

[22] William J. Dally,et al. The GPU Computing Era , 2010, IEEE Micro.

[23] Jie Cheng,et al. Programming Massively Parallel Processors. A Hands-on Approach , 2010, Scalable Comput. Pract. Exp..

[24] L. Shampine. Error estimation and control for ODEs , 2005 .

[25] Samuel Williams,et al. Gyrokinetic toroidal simulations on leading multi- and manycore HPC systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[26] Peter W. Markstein,et al. IA-64 and elementary functions - speed and precision , 2000 .

[27] Michael C. Huang,et al. Particle-in-cell simulations with charge-conserving current deposition on graphic processing units , 2010, J. Comput. Phys..

[28] Samuel Williams,et al. Gyrokinetic particle-in-cell optimization on emerging multi- and manycore platforms , 2011, Parallel Comput..

[29] David A. Patterson,et al. Computer Organization and Design, Fourth Edition, Fourth Edition: The Hardware/Software Interface (The Morgan Kaufmann Series in Computer Architecture and Design) , 2008 .

[30] Mark J. Harris,et al. Parallel Prefix Sum (Scan) with CUDA , 2011 .

[31] Benjamin Bergen,et al. 0.374 Pflop/s trillion-particle kinetic modeling of laser plasma interaction on roadrunner , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[32] Norbert Luttenberger,et al. A Novel Sorting Algorithm for Many-core Architectures Based on Adaptive Bitonic Sort , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[33] J. Little. A Proof for the Queuing Formula: L = λW , 1961 .

[34] Wen-mei W. Hwu,et al. Compute Unified Device Architecture Application Suitability , 2009, Computing in Science & Engineering.