Automatic Code Generation and Optimization of Large-scale Stencil Computation on Many-core Processors

Stencil computation is an indispensable building block of many scientific applications and is widely used by the numerical solvers of partial differential equations (PDEs). Due to the complex computation patterns of different stencils and the various hardware targets (e.g., many-core processors), many domain-specific languages (DSLs) have been proposed to optimize stencil computation. However, existing stencil DSLs mostly focus on the performance optimizations on homogeneous many-core processors such as CPUs and GPUs, and fail to embrace emerging heterogeneous many-core processors such as Sunway. In addition, few of them can support expressing stencil with multiple time dependencies and optimizations from both spatial and temporal dimensions. Moreover, most stencil DSLs are unable to generate codes that can run efficiently in large scale, which limits their practical applicability. In this paper, we propose MSC, a new stencil DSL designed to express stencil computation in both spatial and temporal dimensions. It can generate high-performance stencil codes for large-scale execution on emerging many-core processors. Specially, we design several optimization primitives for improving parallelism and data locality, and a communication library for efficient halo exchange in large scale execution. The experiment results show that our MSC achieves better performance compared to the state-of-the-art stencil DSLs.

[1]  Shoaib Kamil,et al.  Distributed Halide , 2016, PPoPP.

[2]  Wei Ge,et al.  The Sunway TaihuLight supercomputer: system and applications , 2016, Science China Information Sciences.

[3]  Pradeep Dubey,et al.  3.5-D Blocking Optimization for Stencil Computations on Modern CPUs and GPUs , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[4]  Sergei Gorlatch,et al.  High performance stencil code generation with Lift , 2018, CGO.

[5]  Michael E. Wolf,et al.  Combining Loop Transformations Considering Caches and Scheduling , 2004, International Journal of Parallel Programming.

[6]  Guangwen Yang,et al.  swDNN: A Library for Accelerating Deep Learning Applications on Sunway TaihuLight , 2017, 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[7]  Tobias Gysi,et al.  STELLA: a domain-specific tool for structured grid methods in weather and climate models , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[8]  Helmar Burkhart,et al.  PATUS: A Code Generation and Autotuning Framework for Parallel Iterative Stencil Computations on Modern Microarchitectures , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[9]  Satoshi Matsuoka,et al.  Physis: An implicitly parallel programming model for stencil computations on large-scale GPU-accelerated supercomputers , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[10]  Shoaib Kamil,et al.  ParSy: Inspection and Transformation of Sparse Matrix Computations for Parallelism , 2018, SC18: International Conference for High Performance Computing, Networking, Storage and Analysis.

[11]  Uday Bondhugula,et al.  A practical automatic polyhedral parallelizer and locality optimizer , 2008, PLDI '08.

[12]  P. Sadayappan,et al.  High-performance code generation for stencil computations on GPU architectures , 2012, ICS '12.

[13]  V. Natoli,et al.  Exploring New Architectures in Accelerating CFD for Air Force Applications , 2008, 2008 DoD HPCMP Users Group Conference.

[14]  David E. Keyes,et al.  Multicore-Optimized Wavefront Diamond Blocking for Optimizing Stencil Updates , 2014, SIAM J. Sci. Comput..

[15]  Albert Cohen,et al.  Hybrid Hexagonal/Classical Tiling for GPUs , 2014, CGO '14.

[16]  Guangwen Yang,et al.  Massively Scaling Seismic Processing on Sunway TaihuLight Supercomputer , 2020, IEEE Transactions on Parallel and Distributed Systems.

[17]  P. Sadayappan,et al.  On Optimizing Complex Stencils on GPUs , 2019, 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[18]  Weiguo Liu,et al.  18.9-Pflops Nonlinear Earthquake Simulation on Sunway TaihuLight: Enabling Depiction of 18-Hz and 8-Meter Scenarios , 2017, SC17: International Conference for High Performance Computing, Networking, Storage and Analysis.

[19]  Frédo Durand,et al.  Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines , 2013, PLDI.

[20]  Alejandro Duran,et al.  YASK—Yet Another Stencil Kernel: A Framework for HPC Stencil Code-Generation and Tuning , 2016, 2016 Sixth International Workshop on Domain-Specific Languages and High-Level Frameworks for High Performance Computing (WOLFHPC).

[21]  Chao Yang,et al.  26 PFLOPS Stencil Computations for Atmospheric Modeling on Sunway TaihuLight , 2017, 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[22]  Hongbin Zheng,et al.  Polly – Polyhedral optimization in LLVM , 2012 .

[23]  Depei Qian,et al.  Performance Evaluation and Analysis of Linear Algebra Kernels in the Prototype Tianhe-3 Cluster , 2019, SCFA.

[24]  Chau-Wen Tseng,et al.  Tiling Optimizations for 3D Scientific Computations , 2000, ACM/IEEE SC 2000 Conference (SC'00).

[25]  Shoaib Kamil,et al.  Tiramisu: A Polyhedral Compiler for Expressing Fast and Portable Code , 2018, 2019 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[26]  Chao Yang,et al.  10M-Core Scalable Fully-Implicit Solver for Nonhydrostatic Atmospheric Dynamics , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.

[27]  Frédo Durand,et al.  Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines , 2013, PLDI 2013.

[28]  Bradford L. Chamberlain,et al.  Parameterized Diamond Tiling for Stencil Computations with Chapel parallel iterators , 2015, ICS.

[29]  Gihan R. Mudalige,et al.  Loop Tiling in Large-Scale Stencil Codes at Run-Time with OPS , 2017, IEEE Transactions on Parallel and Distributed Systems.

[30]  Volker Strumpen,et al.  Cache oblivious stencil computations , 2005, ICS '05.

[31]  Jia Guo,et al.  Writing productive stencil codes with overlapped tiling , 2009 .

[32]  Shoaib Kamil,et al.  OpenTuner: An extensible framework for program autotuning , 2014, 2014 23rd International Conference on Parallel Architecture and Compilation (PACT).

[33]  Guangwen Yang,et al.  Optimizing high-resolution Community Earth System Model on a heterogeneous many-core supercomputing platform , 2020 .

[34]  Richard Veras,et al.  A stencil compiler for short-vector SIMD architectures , 2013, ICS '13.

[35]  Mohamed Wahib,et al.  AN5D: automated stencil framework for high-degree temporal blocking on GPUs , 2020, CGO.

[36]  Allen Taflove,et al.  Computational Electrodynamics the Finite-Difference Time-Domain Method , 1995 .

[37]  Bradley C. Kuszmaul,et al.  The pochoir stencil compiler , 2011, SPAA '11.

[38]  Charles Yount,et al.  Architecture and Performance of Devito, a System for Automated Stencil Computation , 2018, ACM Trans. Math. Softw..

[39]  Andreas Klöckner,et al.  Loo.py: transformation-based code generation for GPUs and CPUs , 2014, ARRAY@PLDI.

[40]  Guangwen Yang,et al.  Accelerating Sparse Cholesky Factorization on Sunway Manycore Architecture , 2020, IEEE Transactions on Parallel and Distributed Systems.