Scalable communication for high-order stencil computations using CUDA-aware MPI

Modern compute nodes in high-performance computing provide a tremendous level of parallelism and processing power. However, as arithmetic performance has been observed to increase at a faster rate relative to memory and network bandwidths, optimizing data movement has become critical for achieving strong scaling in many communication-heavy applications. This performance gap has been further accentuated with the introduction of graphics processing units, which can provide by multiple factors higher throughput in data-parallel tasks than central processing units. In this work, we explore the computational aspects of iterative stencil loops and implement a generic communication scheme using CUDA-aware MPI, which we use to accelerate magnetohydrodynamics simulations based on high-order finite differences and third-order Runge-Kutta integration. We put particular focus on improving intra-node locality of workloads. In comparison to a theoretical performance model, our implementation exhibits strong scaling from one to 64 devices at 50%–87% efficiency in sixth-order stencil computations when the problem domain consists of 2563–10243 cells.

[1]  Jens-Michael Wierum,et al.  On the Quality of Partitions Based on Space-Filling Curves , 2002, International Conference on Computational Science.

[2]  M. Rheinhardt,et al.  Interaction of Large- and Small-scale Dynamos in Isotropic Turbulent Flows from GPU-accelerated Simulations , 2020, The Astrophysical Journal.

[3]  Sally A. McKee,et al.  Hitting the memory wall: implications of the obvious , 1995, CARN.

[4]  J. Tait,et al.  Challenges and opportunities. , 1996, Journal of psychiatric and mental health nursing.

[5]  Helmar Burkhart,et al.  PATUS: A Code Generation and Autotuning Framework for Parallel Iterative Stencil Computations on Modern Microarchitectures , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[6]  M. Mitchell Waldrop,et al.  The chips are down for Moore’s law , 2016, Nature.

[7]  Vipin Kumar,et al.  A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs , 1998, SIAM J. Sci. Comput..

[8]  K. E. Jordan,et al.  Multiphysics simulations: Challenges and opportunities , 2013, Int. J. High Perform. Comput. Appl..

[9]  Uday Bondhugula,et al.  PolyMage: Automatic Optimization for Image Processing Pipelines , 2015, ASPLOS.

[10]  Samuel Williams,et al.  The Landscape of Parallel Computing Research: A View from Berkeley , 2006 .

[11]  Axel Brandenburg,et al.  Computational aspects of astrophysical MHD and turbulence , 2001, Advances in Nonlinear Dynamos.

[12]  Samuel Williams,et al.  Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[13]  Peter Lindstrom,et al.  Fixed-Rate Compressed Floating-Point Arrays , 2014, IEEE Transactions on Visualization and Computer Graphics.

[14]  Marek Blazewicz,et al.  Using GPU's to accelerate stencil-based computation kernels for the development of large scale scientific applications on heterogeneous systems , 2012, PPoPP '12.

[15]  J. Williamson Low-storage Runge-Kutta schemes , 1980 .

[16]  John Shalf,et al.  The Cactus Framework and Toolkit: Design and Applications , 2002, VECPAR.

[17]  Dietmar Fey,et al.  LibGeoDecomp: A Grid-Enabled Library for Geometric Decomposition Codes , 2008, PVM/MPI.

[18]  Aamir Zia,et al.  Mitigating Memory Wall Effects in High-Clock-Rate and Multicore CMOS 3-D Processor Memory Stacks , 2009, Proceedings of the IEEE.

[19]  Joo-Young Kim,et al.  A Scalable High-Bandwidth Architecture for Lossless Compression on FPGAs , 2015, 2015 IEEE 23rd Annual International Symposium on Field-Programmable Custom Computing Machines.

[20]  Jonathan Ragan-Kelley Decoupling algorithms from the organization of computation for high performance image processing , 2014 .

[21]  Joseph E. Flaherty,et al.  Hierarchical Partitioning and Dynamic Load Balancing for Scientific Computation , 2004, PARA.

[22]  Dhabaleswar K. Panda,et al.  Efficient Inter-node MPI Communication Using GPUDirect RDMA for InfiniBand Clusters with NVIDIA GPUs , 2013, 2013 42nd International Conference on Parallel Processing.

[23]  Omer Anjum,et al.  Methods for compressible fluid simulation on GPUs using high-order finite differences , 2017, Comput. Phys. Commun..

[24]  Samuel Williams,et al.  Roofline: an insightful visual performance model for multicore architectures , 2009, CACM.

[25]  P. Ben'itez-Llambay,et al.  FARGO3D: A NEW GPU-ORIENTED MHD CODE , 2016, 1602.02359.

[26]  Jean Roman,et al.  SCOTCH: A Software Package for Static Mapping by Dual Recursive Bipartitioning of Process and Architecture Graphs , 1996, HPCN Europe.

[27]  David A. Patterson,et al.  Computer Architecture: A Quantitative Approach , 1969 .

[28]  Satoshi Matsuoka,et al.  Physis: An implicitly parallel programming model for stencil computations on large-scale GPU-accelerated supercomputers , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[29]  Berk Hess,et al.  GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers , 2015 .

[30]  Qi Li,et al.  Silicon Photonics for Exascale Systems , 2014, Journal of Lightwave Technology.

[31]  Carole-Jean Wu,et al.  MCM-GPU: Multi-chip-module GPUs for continued performance scalability , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[32]  Laxmikant V. Kalé,et al.  Periodic hierarchical load balancing for large supercomputers , 2011, Int. J. High Perform. Comput. Appl..

[33]  John Shalf,et al.  The future of computing beyond Moore’s Law , 2020, Philosophical Transactions of the Royal Society A.

[34]  Freddie D. Witherden,et al.  PyFR: An open source framework for solving advection-diffusion type problems on streaming architectures using the flux reconstruction approach , 2013, Comput. Phys. Commun..

[35]  David A. Patterson,et al.  Latency lags bandwith , 2004, CACM.

[36]  Frank H. Shu,et al.  The physics of astrophysics. , 1992 .

[37]  James Demmel,et al.  the Parallel Computing Landscape , 2022 .

[38]  Rolf Niedermeier,et al.  Towards optimal locality in mesh-indexings , 1997, Discret. Appl. Math..

[39]  David Kaeli,et al.  Exploiting Adaptive Data Compression to Improve Performance and Energy-Efficiency of Compute Workloads in Multi-GPU Systems , 2019, 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS).