26 PFLOPS Stencil Computations for Atmospheric Modeling on Sunway TaihuLight

Stencil computation arises from a broad set of scientific and engineering applications and often plays a critical role in the performance of extreme-scale simulations. Due to the memory bound nature, it is a challenging task to opti- mize stencil computation kernels on modern supercomputers with relatively high computing throughput whilst relatively low data-moving capability. This work serves as a demon- stration on the details of the algorithms, implementations and optimizations of a real-world stencil computation in 3D nonhydrostatic atmospheric modeling on the newly announced Sunway TaihuLight supercomputer. At the algorithm level, we present a computation-communication overlapping technique to reduce the inter-process communication overhead, a locality- aware blocking method to fully exploit on-chip parallelism with enhanced data locality, and a collaborative data accessing scheme for sharing data among different threads. In addition, a variety of effective hardware specific implementation and optimization strategies on both the process- and thread-level, from the fine-grained data management to the data layout transformation, are developed to further improve the per- formance. Our experiments demonstrate that a single-process many-core speedup of as high as 170x can be achieved by using the proposed algorithm and optimization strategies. The code scales well to millions of cores in terms of strong scalability. And for the weak-scaling tests, the code can scale in a nearly ideal way to the full system scale of more than 10 million cores, sustaining 25.96 PFLOPS in double precision, which is 20% of the peak performance.

[1]  Pawel Gepner,et al.  Adaptation of MPDATA Heterogeneous Stencil Computation to Intel Xeon Phi Coprocessor , 2015, Sci. Program..

[2]  P. Lauritzen Numerical techniques for global atmospheric models , 2011 .

[3]  Pradeep Dubey,et al.  Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU , 2010, ISCA.

[4]  James Demmel,et al.  the Parallel Computing Landscape , 2022 .

[5]  Hirofumi Tomita,et al.  Performance Analysis and Optimization of Nonhydrostatic ICosahedral Atmospheric Model (NICAM) on the K Computer and TSUBAME2.5 , 2016, PASC.

[6]  Satoshi Matsuoka,et al.  Peta-scale phase-field simulation for dendritic solidification on the TSUBAME 2.0 supercomputer , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[7]  Chau-Wen Tseng,et al.  Tiling Optimizations for 3D Scientific Computations , 2000, ACM/IEEE SC 2000 Conference (SC'00).

[8]  Samuel Williams,et al.  The potential of the cell processor for scientific computing , 2005, CF '06.

[9]  Marcel Bauer,et al.  Numerical Methods for Partial Differential Equations , 1994 .

[10]  D. Keyes,et al.  Jacobian-free Newton-Krylov methods: a survey of approaches and applications , 2004 .

[11]  David G. Wonnacott,et al.  Using time skewing to eliminate idle time due to memory bandwidth and network limitations , 2000, Proceedings 14th International Parallel and Distributed Processing Symposium. IPDPS 2000.

[12]  Alan Norton,et al.  Petascale WRF simulation of hurricane sandy: Deployment of NCSA's cray XE6 blue waters , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[13]  Satoshi Matsuoka,et al.  An 80-Fold Speedup, 15.0 TFlops Full GPU Acceleration of Non-Hydrostatic Weather Model ASUCA Production Code , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[14]  W. K. Anderson,et al.  Achieving High Sustained Performance in an Unstructured Mesh CFD Application , 1999, ACM/IEEE SC 1999 Conference (SC'99).

[15]  Paulius Micikevicius,et al.  3D finite difference computation on GPUs using CUDA , 2009, GPGPU-2.

[16]  Samuel Williams,et al.  Implicit and explicit optimizations for stencil computations , 2006, MSPC '06.

[17]  Chao Yang,et al.  A peta-scalable CPU-GPU algorithm for global atmospheric simulations , 2013, PPoPP '13.

[18]  Wei Ge,et al.  The Sunway TaihuLight supercomputer: system and applications , 2016, Science China Information Sciences.

[19]  Nikolaus A. Adams,et al.  11 PFLOP/s simulations of cloud cavitation collapse , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[20]  Samuel Williams,et al.  Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[21]  Franz Franchetti,et al.  Data Layout Transformation for Stencil Computations on Short-Vector SIMD Architectures , 2011, CC.

[22]  Chao Yang,et al.  Ultra-Scalable CPU-MIC Acceleration of Mesoscale Atmospheric Modeling on Tianhe-2 , 2015, IEEE Transactions on Computers.

[23]  Chi-Wang Shu,et al.  Strong Stability-Preserving High-Order Time Discretization Methods , 2001, SIAM Rev..

[24]  Pradeep Dubey,et al.  3.5-D Blocking Optimization for Stencil Computations on Modern CPUs and GPUs , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[25]  Mark A. Taylor,et al.  Progress towards accelerating HOMME on hybrid multi-core systems , 2013, Int. J. High Perform. Comput. Appl..

[26]  Manish Vachharajani,et al.  GPU acceleration of numerical weather prediction , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[27]  P. Lauritzen,et al.  Atmospheric Transport Schemes: Desirable Properties and a Semi-Lagrangian View on Finite-Volume Discretizations , 2011 .

[28]  John Shalf,et al.  HPGMG 1.0: A Benchmark for Ranking High Performance Computing Systems , 2014 .

[29]  Takashi Shimokawabe,et al.  145 TFlops Performance on 3990 GPUs of TSUBAME 2.0 Supercomputer for an Operational Weather Prediction , 2011, ICCS.

[30]  Samuel Williams,et al.  Optimization and Performance Modeling of Stencil Computations on Modern Microprocessors , 2007, SIAM Rev..

[31]  Gerhard Wellein,et al.  Efficient Temporal Blocking for Stencil Computations by Multicore-Aware Wavefront Parallelization , 2009, 2009 33rd Annual IEEE International Computer Software and Applications Conference.

[32]  Volker Strumpen,et al.  Cache oblivious stencil computations , 2005, ICS '05.

[33]  Gerhard Wellein,et al.  Efficient multicore-aware parallelization strategies for iterative stencil computations , 2010, J. Comput. Sci..

[34]  Samuel Williams,et al.  Roofline: an insightful visual performance model for multicore architectures , 2009, CACM.

[35]  Chao Yang,et al.  10M-Core Scalable Fully-Implicit Solver for Nonhydrostatic Atmospheric Dynamics , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.

[36]  Tom Henderson,et al.  Running the NIM Next-Generation Weather Model on GPUs , 2010, 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing.

[37]  Christiane Jablonowski,et al.  Operator-Split Runge-Kutta-Rosenbrock Methods for Nonhydrostatic Atmospheric Models , 2012 .