论文信息 - Large-scale performance of a DSL-based multi-block structured-mesh application for Direct Numerical Simulation

Large-scale performance of a DSL-based multi-block structured-mesh application for Direct Numerical Simulation

Abstract SBLI (Shock-wave/Boundary-layer Interaction) is a large-scale Computational Fluid Dynamics (CFD) application, developed over 20 years at the University of Southampton and extensively used within the UK Turbulence Consortium. It is capable of performing Direct Numerical Simulations (DNS) or Large Eddy Simulation (LES) of shock-wave/boundary-layer interaction problems over highly detailed multi-block structured mesh geometries. SBLI presents major challenges in data organization and movement that need to be overcome for continued high performance on emerging massively parallel hardware platforms. In this paper we present research in achieving this goal through the OPS embedded domain-specific language. OPS targets the domain of multi-block structured mesh applications. It provides an API embedded in C/C++ and Fortran and makes use of automatic code generation and compilation to produce executables capable of running on a range of parallel hardware systems. The core functionality of SBLI is captured using a new framework called OpenSBLI which enables a developer to declare the partial differential equations using Einstein notation and then automatically carryout discretization and generation of OPS (C/C++) API code. OPS is then used to automatically generate a wide range of parallel implementations. Using this multi-layered abstractions approach we demonstrate how new opportunities for further optimizations can be gained, such as fine-tuning the computation intensity and reducing data movement and apply them automatically. Performance results demonstrate there is no performance loss due to the high-level development strategy with OPS and OpenSBLI, with performance matching or exceeding the hand-tuned original code on all CPU nodes tested. The data movement optimizations provide over 3 × speedups on CPU nodes, while GPUs provide 5 × speedups over the best performing CPU node. The OPS generated parallel code also demonstrates excellent scalability on nearly 100K cores on a Cray XC30 (ARCHER at EPCC) and on over 4K GPUs on a CrayXK7 (Titan at ORNL).

[1] G. R. Mudalige,et al. OP2: An active library framework for solving unstructured mesh-based applications on multi-core and many-core architectures , 2012, 2012 Innovative Parallel Computing (InPar).

[2] J. Steelant,et al. Transitional shock-wave/boundary-layer interactions in hypersonic flow , 2014, Journal of Fluid Mechanics.

[3] Anne E. Trefethen,et al. Design and initial performance of a high-level unstructured mesh framework on heterogeneous parallel systems , 2013, Parallel Comput..

[4] Bill Dally. Power, Programmability, and Granularity: The Challenges of ExaScale Computing , 2011, IPDPS.

[5] Daniel Sunderland,et al. Kokkos: Enabling manycore performance portability through polymorphic memory access patterns , 2014, J. Parallel Distributed Comput..

[6] Bradley C. Kuszmaul,et al. The pochoir stencil compiler , 2011, SPAA '11.

[7] Matthew D. Piggott,et al. Firedrake-Fluids v0.1: numerical modelling of shallow water flows using an automated solution framework , 2015 .

[8] Uday Bondhugula,et al. Effective automatic parallelization of stencil computations , 2007, PLDI '07.

[9] William Gropp,et al. Exascale Research: Preparing for the Post-Moore Era , 2011 .

[10] Victor W. Lee,et al. A Metric for Performance Portability , 2016, ArXiv.

[11] Yufeng Yao,et al. Re-engineering a DNS code for high-performance computation of turbulent flows , 2009 .

[12] Jonathan R. Bull,et al. Simulation of the Compressible Taylor Green Vortex using High-Order Flux Reconstruction Schemes , 2014 .

[13] Richard D. Hornung,et al. The RAJA Portability Layer: Overview and Status , 2014 .

[14] Richard D. Sandberg,et al. A primer on direct numerical simulation of turbulence - methods, procedures and guidelines , 2010 .

[15] Gihan R. Mudalige,et al. Improving resilience of scientific software through a domain-specific approach , 2019, J. Parallel Distributed Comput..

[16] Uday Bondhugula,et al. PolyMage: Automatic Optimization for Image Processing Pipelines , 2015, ASPLOS.

[17] Shoaib Kamil,et al. Distributed Halide , 2016, PPoPP.

[18] Anders Logg,et al. Automated Code Generation for Discontinuous Galerkin Methods , 2008, SIAM J. Sci. Comput..

[19] Neil D. Sandham,et al. Energy Consumption of Algorithms for Solving the Compressible Navier-Stokes Equations on CPU’s, GPU’s and KNL’s , 2018 .

[20] Gihan R. Mudalige,et al. Loop Tiling in Large-Scale Stencil Codes at Run-Time with OPS , 2017, IEEE Transactions on Parallel and Distributed Systems.

[21] Neil D. Sandham,et al. Shock-wave/boundary-layer interactions in the automatic source-code generation framework OpenSBLI , 2018, Computers & Fluids.

[22] J. Debonis. Solutions of the Taylor-Green Vortex Problem Using High-Resolution Explicit Finite Difference Methods , 2013 .

[23] Michael B. Giles,et al. High Performance Computing on the IBM Power8 Platform , 2016, ISC Workshops.

[24] J. A. Herdman,et al. Performance Analysis of a High-Level Abstractions-Based Hydrocode on Future Computing Systems , 2014, PMBS@SC.

[25] Freddie D. Witherden,et al. Towards Green Aviation with Python at Petascale , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.

[26] Neil D. Sandham,et al. OpenSBLI: A framework for the automated derivation and parallel execution of finite difference solvers on a range of computer architectures , 2016, J. Comput. Sci..

[27] Andrew T. T. McRae,et al. Firedrake: automating the finite element method by composing abstractions , 2015, ACM Trans. Math. Softw..

[28] Frédo Durand,et al. Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines , 2013, PLDI 2013.

[29] Neil D. Sandham,et al. Performance evaluation of explicit finite difference algorithms with varying amounts of computational and memory intensity , 2016, J. Comput. Sci..

[30] S. Orszag,et al. Small-scale structure of the Taylor–Green vortex , 1983, Journal of Fluid Mechanics.

[31] Michael Lange,et al. Devito: Towards a Generic Finite Difference DSL Using Symbolic Python , 2016, 2016 6th Workshop on Python for High-Performance and Scientific Computing (PyHPC).

[32] Tobias Gysi,et al. STELLA: a domain-specific tool for structured grid methods in weather and climate models , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[33] Uday Bondhugula,et al. Automatic Transformations for Communication-Minimized Parallelization and Locality Optimization in the Polyhedral Model , 2008, CC.

[34] Martin Griebl,et al. Automatic code generation for distributed memory architectures in the polytope model , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[35] Neil D. Sandham,et al. An error indicator for finite difference methods using spectral techniques with application to aerofoil simulation , 2018 .

[36] Paul H. J. Kelly,et al. Acceleration of a Full-Scale Industrial CFD Application with OP2 , 2014, IEEE Transactions on Parallel and Distributed Systems.