Memory-efficient optimization of Gyrokinetic particle-to-grid interpolation for multicore processors

We present multicore parallelization strategies for the particle-to-grid interpolation step in the Gyrokinetic Toroidal Code (GTC), a 3D particle-in-cell (PIC) application to study turbulent transport in magnetic-confinement fusion devices. Particle-grid interpolation is a known performance bottleneck in several PIC applications. In GTC, this step involves particles depositing charges to a 3D toroidal mesh, and multiple particles may contribute to the charge at a grid point. We design new parallel algorithms for the GTC charge deposition kernel, and analyze their performance on three leading multicore platforms. We implement thirteen different variants for this kernel and identify the best-performing ones given typical PIC parameters such as the grid size, number of particles per cell, and the GTC-specific particle Larmor radius variation. We find that our best strategies can be 2x faster than the reference optimized MPI implementation, and our analysis provides insight into desirable architectural features for high-performance PIC simulation codes.

[1]  T. Hahm,et al.  Turbulent transport reduction by zonal flows: massively parallel simulations , 1998, Science.

[2]  Leonid Oliker,et al.  Large-scale gyrokinetic particle simulation of microturbulence in magnetically confined fusion plasmas , 2008, IBM J. Res. Dev..

[3]  R. Aymar,et al.  The ITER project , 1997 .

[4]  Nail A. Gumerov,et al.  Fast parallel Particle-To-Grid interpolation for plasma PIC simulations on the GPU , 2008, J. Parallel Distributed Comput..

[5]  Anne C. Elster Parallelization issues and particle-in-cell codes , 1994 .

[6]  Samuel Williams,et al.  Optimization of sparse matrix-vector multiplication on emerging multicore platforms , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[7]  P H Rutherford The ITER Project. , 1996, Science.

[8]  Stephane Ethier,et al.  Performance of particle in cell methods on highly concurrent computational architectures , 2007 .

[9]  Kevin J. Bowers,et al.  Accelerating a paricle -in-cell simulation using a hybrid counting sort , 2001 .

[10]  John Mellor-Crummey,et al.  Managing locality in grand challenge applications: a case study of the gyrokinetic toroidal code , 2008 .

[11]  W. Lee,et al.  Gyrokinetic Particle Simulation Model , 1987 .

[12]  Samuel Williams,et al.  A Generalized Framework for Auto-tuning Stencil Computations , 2009 .

[13]  Benjamin Bergen,et al.  0.374 Pflop/s trillion-particle kinetic modeling of laser plasma interaction on roadrunner , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[14]  Leonid Oliker,et al.  Scientific Computations on Modern Parallel Vector Systems , 2004, Proceedings of the ACM/IEEE SC2004 Conference.

[15]  Leonid Oliker,et al.  Scientific Application Performance on Candidate PetaScale Platforms , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[16]  S. Ethier,et al.  Gyrokinetic particle-in-cell simulations of plasma microturbulence on advanced computing platforms , 2005 .