Overlapping communications in gyrokinetic codes on accelerator‐based platforms

Communication and computation overlapping techniques have been introduced in the five‐dimensional gyrokinetic codes GYSELA and GKV. In order to anticipate some of the exa‐scale requirements, these codes were ported to the modern accelerators, Xeon Phi KNL and Tesla P 100 GPU. On accelerators, a serial version of GYSELA on KNL and GKV on GPU are respectively 1.3× and 7.4× faster than those on a single Skylake processor (a single socket). For the scalability, we have measured GYSELA performance on Xeon Phi KNL from 16 to 512 KNLs (1024 to 32k cores) and GKV performance on Tesla P 100 GPU from 32 to 256 GPUs. In their parallel versions, transpose communication in semi‐Lagrangian solver in GYSELA or Convolution kernel in GKV turned out to be a main bottleneck. This indicates that in the exa‐scale, the network constraints would be critical. In order to mitigate the communication costs, the pipeline and task‐based overlapping techniques have been implemented in these codes. The GYSELA 2D advection solver has achieved a 33% to 92% speed up, and the GKV 2D convolution kernel has achieved a factor of 2 speed up with pipelining. The task‐based approach gives 11% to 82% performance gain in the derivative computation of the electrostatic potential in GYSELA. We have shown that the pipeline‐based approach is applicable with the presence of symmetry, while the task‐based approach can be applicable to more general situations.

[1]  Virginie Grandgirard,et al.  A multi-species collisional operator for full-F global gyrokinetics codes: Numerical aspects and verification with the GYSELA code , 2019, Comput. Phys. Commun..

[2]  Virginie Grandgirard,et al.  A multi-species collisional operator for full-F gyrokinetics , 2015 .

[3]  Virginie Grandgirard,et al.  SCALING GYSELA CODE BEYOND 32K-CORES ON BLUE GENE/Q ; , 2013 .

[4]  V. Grandgirard,et al.  Gyroaverage operator for a polar mesh , 2015 .

[5]  Song Huang,et al.  On the energy efficiency of graphics processing units for scientific computing , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[6]  M. Mehrenberger,et al.  Numerical Solution of the Gyroaverage Operator for the Finite Gyroradius Guiding-center Model , 2010 .

[7]  Masanori Nunami,et al.  Improved strong scaling of a spectral/finite difference gyrokinetic code for multi-scale plasma turbulence , 2015, Parallel Comput..

[8]  Virginie Grandgirard,et al.  Scaling and Optimizing the Gysela Code on a Cluster of Many-Core Processors , 2018, 2018 30th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD).

[9]  Gregory W. Hammett,et al.  Field‐aligned coordinates for nonlinear simulations of tokamak turbulence , 1995 .

[10]  X. Garbet,et al.  Benchmarking of flux-driven full-F gyrokinetic simulations , 2017 .

[11]  Guillaume Latu,et al.  Scalable Quasineutral Solver for Gyrokinetic Simulation , 2011, PPAM.

[12]  Cédric Augonnet,et al.  StarPU: a unified platform for task scheduling on heterogeneous multicore architectures , 2011, Concurr. Comput. Pract. Exp..

[13]  Hikaru Inoue,et al.  Communication-overlap techniques for improved strong scaling of gyrokinetic Eulerian code beyond 100k cores on the K-computer , 2014, 2012 SC Companion: High Performance Computing, Networking Storage and Analysis.

[14]  Alain J. Brizard,et al.  Foundations of Nonlinear Gyrokinetic Theory , 2007 .

[15]  Henri Calandra,et al.  Fast seismic modeling and reverse time migration on a graphics processing unit cluster , 2012, Concurr. Comput. Pract. Exp..

[16]  Hideo Sugama,et al.  Velocity–space structures of distribution function in toroidal ion temperature gradient turbulence , 2011 .

[17]  Virginie Grandgirard,et al.  Optimization of Fusion Kernels on Accelerators with Indirect or Strided Memory Access Patterns , 2017, IEEE Transactions on Parallel and Distributed Systems.

[18]  Laurent Villard,et al.  Gyrokinetic simulations of turbulent transport , 2010 .

[19]  T. Antonsen,et al.  Kinetic equations for low frequency instabilities in inhomogeneous plasmas , 1980 .

[20]  Andreas Marek,et al.  Porting Large HPC Applications to GPU Clusters: The Codes GENE and VERTEX , 2013, PARCO.

[21]  G. Strang On the Construction and Comparison of Difference Schemes , 1968 .

[22]  Yasuhiro Idomura,et al.  Computation-Communication Overlap Techniques for Parallel Spectral Calculations in Gyrokinetic Vlasov Simulations , 2013 .

[23]  Masanori Nunami,et al.  Improved collision operator for plasma kinetic simulations with multi-species ions and electrons , 2015, Comput. Phys. Commun..

[24]  Laurent Villard,et al.  A portable platform for accelerated PIC codes and its application to GPUs using OpenACC , 2016, Comput. Phys. Commun..

[25]  D. Esteve,et al.  A 5D gyrokinetic full-f global semi-Lagrangian code for flux-driven ion turbulence simulations , 2016, Comput. Phys. Commun..

[26]  Toshiyuki Imamura,et al.  Application of a communication-avoiding generalized minimal residual method to a gyrokinetic five dimensional eulerian code on many core platforms , 2017, ScalA@SC.