Extreme-Scale Realistic Stencil Computations on Sunway TaihuLight with Ten Million Cores

Stencil computation arises from a large variety of scientific and engineering applications and often plays a critical role in the performance of extreme-scale simulations. Due to the memory bound nature, it is a challenging task to optimize stencil computation kernels on many leadership supercomputers, such as Sunway TaihuLight, which has relatively high computing throughput whilst relatively low data-moving capability. In this white paper, we show the efforts we have been making during the past two years in developing end-to-end implementation and optimization techniques for extreme-scale stencil computations on Sunway TaihuLight. We started with a work on optimizing the 3-D 2nd-order 13-point stencil for nonhydrostatic atmospheric dynamics simulation, which is an important part of the 2016 ACM Gordon Bell Prize winning work, and extended it in ways that can handle a broader range of realistic and challenging problems, such as the HPGMG benchmark that consists of memory-hungry stencils and the gaseous wave detonation simulation that relies on complex high-order stencils. The presented stencil computation paradigm on Sunway TaihuLight includes not only multilevel parallelization to exploit the parallelism on different hardware levels, but also systematic performance optimization techniques for communication, memory access, and computation. We show by extreme-scale tests that the proposed systematic stencil computation paradigm can successfully deliver remarkable performance on Sunway TaihuLight with ten million heterogeneous cores. In particular, we achieve an aggregate performance of 23.12 Pflops for the 3-D 5th order WENO stencil computation in gaseous wave detonation simulation, which is the highest performance result for high-order stencil computations as far as we know, and an aggregate performance of solving over one trillion unknowns per second in the HPGMG benchmark, which ranks the first place in the HPGMG List of Nov 2017.

[1]  Jianxian Qiu,et al.  Simulations of detonation wave propagation in rectangular ducts using a three-dimensional WENO scheme , 2008 .

[2]  Chi Xue-bin,et al.  Extreme-Scale Phase Field Simulations of Coarsening Dynamics on the Sunway TaihuLight Supercomputer , 2016 .

[3]  Weiguo Liu,et al.  18.9-Pflops Nonlinear Earthquake Simulation on Sunway TaihuLight: Enabling Depiction of 18-Hz and 8-Meter Scenarios , 2017, SC17: International Conference for High Performance Computing, Networking, Storage and Analysis.

[4]  Frédo Durand,et al.  Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines , 2013, PLDI 2013.

[5]  Weiqiang Wang,et al.  A Multilevel Parallelization Framework for High-Order Stencil Computations , 2009, Euro-Par.

[6]  Catherine W. French,et al.  Development and implementation of the effective force testing method for seismic simulation of large–scale structures , 2001, Philosophical Transactions of the Royal Society of London. Series A: Mathematical, Physical and Engineering Sciences.

[7]  Christiane Jablonowski,et al.  Operator-Split Runge-Kutta-Rosenbrock Methods for Nonhydrostatic Atmospheric Models , 2012 .

[8]  Marcel Bauer,et al.  Numerical Methods for Partial Differential Equations , 1994 .

[9]  Hans Petter Langtangen,et al.  Computational Partial Differential Equations - Numerical Methods and Diffpack Programming , 1999, Lecture Notes in Computational Science and Engineering.

[10]  Pradeep Dubey,et al.  3.5-D Blocking Optimization for Stencil Computations on Modern CPUs and GPUs , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[11]  Wenguang Chen,et al.  Scalable Graph Traversal on Sunway TaihuLight with Ten Million Cores , 2017, 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[12]  Chao Yang,et al.  Scaling and analyzing the stencil performance on multi-core and many-core architectures , 2014, 2014 20th IEEE International Conference on Parallel and Distributed Systems (ICPADS).

[13]  Mark A. Taylor,et al.  CAM-SE: A scalable spectral element dynamical core for the Community Atmosphere Model , 2012, Int. J. High Perform. Comput. Appl..

[14]  Pawel Gepner,et al.  Adaptation of MPDATA Heterogeneous Stencil Computation to Intel Xeon Phi Coprocessor , 2015, Sci. Program..

[15]  Xu Ping,et al.  10M-Core Scalable Fully-Implicit Solver for Nonhydrostatic Atmospheric Dynamics , 2016 .

[16]  Peng Zhang,et al.  Towards Highly Efficient DGEMM on the Emerging SW26010 Many-Core Processor , 2017, 2017 46th International Conference on Parallel Processing (ICPP).

[17]  Weiguo Liu,et al.  Redesigning CAM-SE for Peta-Scale Climate Modeling Performance and Ultra-High Resolution on Sunway TaihuLight , 2017, SC17: International Conference for High Performance Computing, Networking, Storage and Analysis.

[18]  Samuel Williams,et al.  Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[19]  ANTONIN CHAMBOLLE,et al.  An Algorithm for Total Variation Minimization and Applications , 2004, Journal of Mathematical Imaging and Vision.

[20]  Satoshi Matsuoka,et al.  Peta-scale phase-field simulation for dendritic solidification on the TSUBAME 2.0 supercomputer , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[21]  Jianguo Ning,et al.  High Resolution WENO Simulation of 3D Detonation Waves , 2013 .

[22]  Xin Liu,et al.  A Highly Effective Global Surface Wave Numerical Simulation with Ultra-High Resolution , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.

[23]  Guoping Long,et al.  Highly Optimized Code Generation for Stencil Codes with Computation Reuse for GPUs , 2016, Journal of Computer Science and Technology.