Solving a trillion unknowns per second with HPGMG on Sunway TaihuLight

Benchmarks for supercomputers are important tools, not only for evaluating and ranking modern supercomputers, but also for providing hints for future architecture design. As a new benchmark, HPGMG (high performance geometric multigrid) solves a linear equation set with a full geometric multi-grid algorithm. It involves computation on different scales, data movement with various volumes, global communication and neighbor communication with both large and small messages, etc., and is more correlated to real world applications than traditional benchmarks such as LINPACK. Therefore, it is desirable to examine how well HPGMG can perform on leadership supercomputers such as Sunway Taihulight. Sunway Taihulight, the No. 1 supercomputer in the Top 500 list from June 2016 to June 2018, which uses a specially designed many-core architecture SW26010, is of great interest to the community of high performance computing. With careful analysis and code design, we came up with an efficient implementation of HPGMG on SW26010 processors. We not only employed traditional optimization techniques such as 2.5D partitioning, double buffering, and collective data load, but also introduced a micro-benchmark to help with the choice of optimization direction and parameter tuning. Another contribution is that we proposed a new procedure for the major operations, by granulating and reordering the smooth function and the ghost exchange operation, leading to reduced memory copy and accelerated communication process. Our optimized implementation of HPGMG on Sunway TaihuLight achieved a ground-breaking performance of $$1.036\times 10^{12}$$1.036×1012 Degrees of Freedom per second at the finest level, which is No. 1 on the HPGMG list of Nov 2017.

[1]  P. Sadayappan,et al.  High-performance code generation for stencil computations on GPU architectures , 2012, ICS '12.

[2]  Weiguo Liu,et al.  Redesigning CAM-SE for Peta-Scale Climate Modeling Performance and Ultra-High Resolution on Sunway TaihuLight , 2017, SC17: International Conference for High Performance Computing, Networking, Storage and Analysis.

[3]  Naoya Maruyama,et al.  Optimizing Stencil Computations for NVIDIA Kepler GPUs , 2014 .

[4]  Samuel Williams,et al.  The potential of the cell processor for scientific computing , 2005, CF '06.

[5]  Samuel Williams,et al.  Compiler generation and autotuning of communication-avoiding operators for geometric multigrid , 2013, 20th Annual International Conference on High Performance Computing.

[6]  Daniel Ritter,et al.  A Geometric Multigrid Solver on GPU Clusters , 2013 .

[7]  Ninghui Sun,et al.  Fast implementation of DGEMM on Fermi GPU , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[8]  Peter Kilpatrick,et al.  A parallel pattern for iterative stencil + reduce , 2016, The Journal of Supercomputing.

[9]  Samuel Williams,et al.  Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[10]  Samuel Williams,et al.  Converting Stencils to Accumulations Forcommunication-Avoiding Optimizationin Geometric Multigrid , 2014 .

[11]  Li Kenli,et al.  Implementing Molecular Dynamics Simulation on Sunway TaihuLight System , 2016 .

[12]  Peng Zhang,et al.  Towards Highly Efficient DGEMM on the Emerging SW26010 Many-Core Processor , 2017, 2017 46th International Conference on Parallel Processing (ICPP).

[13]  Samuel Williams,et al.  Compiler-based code generation and autotuning for geometric multigrid on GPU-accelerated supercomputers , 2017, Parallel Comput..

[14]  Samuel Williams,et al.  Auto-Tuning Stencil Computations on Multicore and Accelerators , 2010, Scientific Computing with Multicore and Accelerators.

[15]  Chao Yang,et al.  26 PFLOPS Stencil Computations for Atmospheric Modeling on Sunway TaihuLight , 2017, 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[16]  Xin Liu,et al.  A Highly Effective Global Surface Wave Numerical Simulation with Ultra-High Resolution , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.

[17]  Guoping Long,et al.  Highly Optimized Code Generation for Stencil Codes with Computation Reuse for GPUs , 2016, Journal of Computer Science and Technology.

[18]  Ulrich Rüde,et al.  A Geometric Multigrid Solver on Tsubame 2.0 , 2011, Efficient Algorithms for Global Optimization Methods in Computer Vision.

[19]  Pradeep Dubey,et al.  3.5-D Blocking Optimization for Stencil Computations on Modern CPUs and GPUs , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[20]  DongarraJack,et al.  High-performance conjugate-gradient benchmark , 2016 .

[21]  Jack J. Dongarra,et al.  The LINPACK Benchmark: past, present and future , 2003, Concurr. Comput. Pract. Exp..

[22]  Jack J. Dongarra,et al.  High-performance conjugate-gradient benchmark: A new metric for ranking high-performance computing systems , 2016, Int. J. High Perform. Comput. Appl..

[23]  John Shalf,et al.  HPGMG 1.0: A Benchmark for Ranking High Performance Computing Systems , 2014 .

[24]  Peng Zhang,et al.  Performance Evaluation of HPGMG on Tianhe-2: Early Experience , 2015, ICA3PP.

[25]  Wei Cao,et al.  CPU/GPU computing for a multi-block structured grid based high-order flow solver on a large heterogeneous system , 2013, Cluster Computing.

[26]  Wei Ge,et al.  The Sunway TaihuLight supercomputer: system and applications , 2016, Science China Information Sciences.

[27]  Samuel Williams,et al.  Optimization of geometric multigrid for emerging multi- and manycore processors , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[28]  Frank Mueller,et al.  Auto-generation and auto-tuning of 3D stencil codes on GPU clusters , 2012, CGO '12.

[29]  Xu Ping,et al.  10M-Core Scalable Fully-Implicit Solver for Nonhydrostatic Atmospheric Dynamics , 2016 .

[30]  JaeHyuk Kwack,et al.  HPCG and HPGMG benchmark tests on multiple program, multiple data (MPMD) mode on Blue Waters—A Cray XE6/XK7 hybrid system , 2018, Concurr. Comput. Pract. Exp..

[31]  J. Ramanujam,et al.  A framework for enhancing data reuse via associative reordering , 2014, PLDI.

[32]  Chi Xue-bin,et al.  Extreme-Scale Phase Field Simulations of Coarsening Dynamics on the Sunway TaihuLight Supercomputer , 2016 .

[33]  Weiguo Liu,et al.  18.9-Pflops Nonlinear Earthquake Simulation on Sunway TaihuLight: Enabling Depiction of 18-Hz and 8-Meter Scenarios , 2017, SC17: International Conference for High Performance Computing, Networking, Storage and Analysis.

[34]  Sergei Gorlatch,et al.  High performance stencil code generation with Lift , 2018, CGO.