Extreme-Scale High-Order WENO Simulations of 3-D Detonation Wave with 10 Million Cores

High-order stencil computations, frequently found in many applications, pose severe challenges to emerging many-core platforms due to the complexities of hardware architectures as well as the sophisticated computing and data movement patterns. In this article, we tackle the challenges of high-order WENO computations in extreme-scale simulations of 3D gaseous waves on Sunway TaihuLight. We design efficient parallelization algorithms and present effective optimization techniques to fully exploit various parallelisms with reduced memory footprints, enhanced data reuse, and balanced computation load. Test results show the optimized code can scale to 9.98 million cores, solving 12.74 trillion unknowns with 23.12 Pflops double-precision performance.

[1]  Xiangxiong Zhang,et al.  Positivity-preserving high order finite difference WENO schemes for compressible Euler equations , 2012, J. Comput. Phys..

[2]  Jianguo Ning,et al.  High Resolution WENO Simulation of 3D Detonation Waves , 2013 .

[3]  Xin Liu,et al.  A Highly Effective Global Surface Wave Numerical Simulation with Ultra-High Resolution , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.

[4]  Jianxian Qiu,et al.  Simulations of detonation wave propagation in rectangular ducts using a three-dimensional WENO scheme , 2008 .

[5]  Weiguo Liu,et al.  18.9-Pflops Nonlinear Earthquake Simulation on Sunway TaihuLight: Enabling Depiction of 18-Hz and 8-Meter Scenarios , 2017, SC17: International Conference for High Performance Computing, Networking, Storage and Analysis.

[6]  Samuel Williams,et al.  Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[7]  Kenli Li,et al.  Implementing Molecular Dynamics Simulation on Sunway TaihuLight System , 2016, 2016 IEEE 18th International Conference on High Performance Computing and Communications; IEEE 14th International Conference on Smart City; IEEE 2nd International Conference on Data Science and Systems (HPCC/SmartCity/DSS).

[8]  Pawel Gepner,et al.  Using Intel Xeon Phi Coprocessor to Accelerate Computations in MPDATA Algorithm , 2013, PPAM.

[9]  Irene M. Gamba,et al.  Device benchmark comparisons via kinetic, hydrodynamic, and high-hield models , 2000 .

[10]  Shimpei Sato,et al.  Investigating potential performance benefits of memory layout optimization based on roofline model , 2015, SEPS@SPLASH.

[11]  Weiguo Liu,et al.  Redesigning CAM-SE for Peta-Scale Climate Modeling Performance and Ultra-High Resolution on Sunway TaihuLight , 2017, SC17: International Conference for High Performance Computing, Networking, Storage and Analysis.

[12]  Wang Cheng,et al.  Large-Scale Parallel Computing for 3D Gaseous Detonation , 2013, ParCo 2013.

[13]  Williama Putnam Graphics Processing Unit (GPU) Acceleration of the Goddard Earth Observing System Atmospheric Model , 2011 .

[14]  P. Roe Approximate Riemann Solvers, Parameter Vectors, and Difference Schemes , 1997 .

[15]  Chao Yang,et al.  10M-Core Scalable Fully-Implicit Solver for Nonhydrostatic Atmospheric Dynamics , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.

[16]  Scott B. Baden,et al.  Panda: A Compiler Framework for Concurrent CPU+\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$+$$\end{document}GPU Ex , 2016, International Journal of Parallel Programming.

[17]  Chi-Wang Shu,et al.  Efficient Implementation of Weighted ENO Schemes , 1995 .

[18]  Liu Peng,et al.  High-order stencil computations on multicore clusters , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[19]  Samuel Williams,et al.  Compiler-Directed Transformation for Higher-Order Stencils , 2015, 2015 IEEE International Parallel and Distributed Processing Symposium.

[20]  Cheng Wang,et al.  Robust high order discontinuous Galerkin schemes for two-dimensional gaseous detonations , 2012, J. Comput. Phys..

[21]  Chao Yang,et al.  26 PFLOPS Stencil Computations for Atmospheric Modeling on Sunway TaihuLight , 2017, 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[22]  Irene M. Gamba,et al.  A WENO-solver for the transients of Boltzmann-Poisson system for semiconductor devices: performance and comparisons with Monte Carlo methods , 2003 .

[23]  Susumu Teramoto,et al.  Large Eddy Simulation of Shock Wave/Boundary Layer Interaction , 2005 .

[24]  Pawel Gepner,et al.  Adaptation of MPDATA Heterogeneous Stencil Computation to Intel Xeon Phi Coprocessor , 2015, Sci. Program..

[25]  Satoshi Matsuoka,et al.  Peta-scale phase-field simulation for dendritic solidification on the TSUBAME 2.0 supercomputer , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[26]  Chau-Wen Tseng,et al.  Tiling Optimizations for 3D Scientific Computations , 2000, ACM/IEEE SC 2000 Conference (SC'00).

[27]  Joseph E. Shepherd,et al.  THE STRUCTURE OF THE DETONATION FRONT IN GASES , 2002 .

[28]  Cheng Wang,et al.  Efficient implementation of high order inverse Lax-Wendroff boundary treatment for conservation laws , 2012, J. Comput. Phys..

[29]  Guoping Long,et al.  Highly Optimized Code Generation for Stencil Codes with Computation Reuse for GPUs , 2016, Journal of Computer Science and Technology.

[30]  TianBao Ma,et al.  Influence of obstacle disturbance in a duct on explosion characteristics of coal gas , 2010 .

[31]  Chao Yang,et al.  Enabling and Scaling a Global Shallow-Water Atmospheric Model on Tianhe-2 , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[32]  Guang-Shan Jiang,et al.  A High-Order WENO Finite Difference Scheme for the Equations of Ideal Magnetohydrodynamics , 1999 .

[33]  Diego Rossinelli,et al.  An Efficient Compressible Multicomponent Flow Solver for Heterogeneous CPU/GPU Architectures , 2016, PASC.

[34]  Samuel Williams,et al.  Optimization and Performance Modeling of Stencil Computations on Modern Microprocessors , 2007, SIAM Rev..

[35]  Guangwen Yang,et al.  swDNN: A Library for Accelerating Deep Learning Applications on Sunway TaihuLight , 2017, 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[36]  I. Bohachevsky,et al.  Finite difference method for numerical computation of discontinuous solutions of the equations of fluid dynamics , 1959 .

[37]  Murray Cole,et al.  PARTANS: An autotuning framework for stencil computation on multi-GPU systems , 2013, TACO.

[38]  Jing-Mei Qiu,et al.  A WENO algorithm for the radiative transfer and ionized sphere at reionization , 2006 .

[39]  Wang Yan-ping,et al.  Simulation study on an explosion accident in china , 2014 .

[40]  S. Osher,et al.  Weighted essentially non-oscillatory schemes , 1994 .

[41]  Wei Ge,et al.  The Sunway TaihuLight supercomputer: system and applications , 2016, Science China Information Sciences.

[42]  L. Eriksson,et al.  Theory and modeling of accelerating flames in tubes. , 2005, Physical review. E, Statistical, nonlinear, and soft matter physics.

[43]  Cheng Wang,et al.  Parallel adaptive mesh refinement method based on WENO finite difference scheme for the simulation of multi-dimensional detonation , 2015, J. Comput. Phys..

[44]  Chi-Wang Shu,et al.  High Order Weighted Essentially Nonoscillatory Schemes for Convection Dominated Problems , 2009, SIAM Rev..

[45]  Pradeep Dubey,et al.  3.5-D Blocking Optimization for Stencil Computations on Modern CPUs and GPUs , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[46]  Lukasz Szustak Czestochowa,et al.  Toward efficient distribution of MPDATA stencil computation on Intel MIC architecture , 2013 .

[47]  Peng Zhang,et al.  Towards Highly Efficient DGEMM on the Emerging SW26010 Many-Core Processor , 2017, 2017 46th International Conference on Parallel Processing (ICPP).

[48]  Sergio Pirozzoli,et al.  Dynamics of ring vortices impinging on planar shock waves , 2004 .

[49]  Jian Zhang,et al.  Extreme-Scale Phase Field Simulations of Coarsening Dynamics on the Sunway TaihuLight Supercomputer , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.

[50]  Weiqiang Wang,et al.  A Multilevel Parallelization Framework for High-Order Stencil Computations , 2009, Euro-Par.

[51]  Chao Yang,et al.  Ultra-Scalable CPU-MIC Acceleration of Mesoscale Atmospheric Modeling on Tianhe-2 , 2015, IEEE Transactions on Computers.

[52]  Wenguang Chen,et al.  Scalable Graph Traversal on Sunway TaihuLight with Ten Million Cores , 2017, 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[53]  Chi-Wang Shu,et al.  Strong Stability-Preserving High-Order Time Discretization Methods , 2001, SIAM Rev..

[54]  Technology of China,et al.  A Hybrid Cosmological Hydrodynamic/N-Body Code Based on a Weighted Essentially Nonoscillatory Scheme , 2004 .