Parallelization of 2D MPDATA EULAG algorithm on hybrid architectures with GPU accelerators

Abstract EULAG (Eulerian/semi-Lagrangian fluid solver) is an established computational model developed for simulating thermo-fluid flows across a wide range of scales and physical scenarios. The dynamic core of EULAG includes the multidimensional positive definite advection transport algorithm (MPDATA) and elliptic solver. In this work we investigate aspects of an optimal parallel version of the 2D MPDATA algorithm on modern hybrid architectures with GPU accelerators, where computations are distributed across both GPU and CPU components. Using the hybrid OpenMP–OpenCL model of parallel programming opens the way to harness the power of CPU–GPU platforms in a portable way. In order to better utilize features of such computing platforms, comprehensive adaptations of MPDATA computations to hybrid architectures are proposed. These adaptations are based on efficient strategies for memory and computing resource management, which allow us to ease memory and communication bounds, and better exploit the theoretical floating point efficiency of CPU–GPU platforms. The main contributions of the paper are: • method for the decomposition of the 2D MPDATA algorithm as a tool to adapt MPDATA computations to hybrid architectures with GPU accelerators by minimizing communication and synchronization between CPU and GPU components at the cost of additional computations; • method for the adaptation of 2D MPDATA computations to multicore CPU platforms, based on space and temporal blocking techniques; • method for the adaptation of the 2D MPDATA algorithm to GPU architectures, based on a hierarchical decomposition strategy across data and computation domains, with support provided by the developed GPU task scheduler allowing for the flexible management of available resources; • approach to the parametric optimization of 2D MPDATA computations on GPUs using the autotuning technique, which allows us to provide a portable implementation methodology across a variety of GPUs. Hybrid platforms tested in this study contain different numbers of CPUs and GPUs – from solutions consisting of a single CPU and a single GPU to the most elaborate configuration containing two CPUs and two GPUs. Processors of different vendors are employed in these systems – both Intel and AMD CPUs, as well as GPUs from NVIDIA and AMD. For all the grid sizes and for all the tested platforms, the hybrid version with computations spread across CPU and GPU components allows us to achieve the highest performance. In particular, for the largest MPDATA grids used in our experiments, the speedups of the hybrid versions over GPU and CPU versions vary from 1.30 to 1.69, and from 1.95 to 2.25, respectively.

[1]  Jim Jeffers Intel® Xeon Phi™ Coprocessors , 2013 .

[2]  David F. Bacon,et al.  Compiler transformations for high-performance computing , 1994, CSUR.

[3]  Hiroaki Kobayashi,et al.  Automatic Tuning of CUDA Execution Parameters for Stencil Processing , 2010, Software Automatic Tuning, From Concepts to State-of-the-Art Results.

[4]  Roman Wyrzykowski,et al.  Parallel Implementation of Conjugate Gradient Method on Graphics Processors , 2009, PPAM.

[5]  Gerhard Wellein,et al.  Efficient multicore-aware parallelization strategies for iterative stencil computations , 2010, J. Comput. Sci..

[6]  Alfonso Niño,et al.  A Survey of Parallel Programming Models and Tools in the Multi and Many-core Era , 2022 .

[7]  Kevin Skadron,et al.  Performance modeling and automatic ghost zone optimization for iterative stencil loops on GPUs , 2009, ICS.

[8]  Lukasz Szustak,et al.  Using Blue Gene/P and GPUs to Accelerate Computations in the EULAG Model , 2011, LSSC.

[9]  Piotr K. Smolarkiewicz,et al.  FORWARD-IN-TIME DIFFERENCING FOR FLUIDS: SIMULATION OF GEOPHYSICAL TURBULENCE , 2002 .

[10]  P ? ? ? ? ? ? ? % ? ? ? ? , 1991 .

[11]  Matthias Christen,et al.  Generating and auto-tuning parallel stencil codes , 2011 .

[12]  Timothy G. Mattson,et al.  OpenCL Programming Guide , 2011 .

[13]  Cédric Augonnet,et al.  Data-Aware Task Scheduling on Multi-accelerator Based Platforms , 2010, 2010 IEEE 16th International Conference on Parallel and Distributed Systems.

[14]  José María Cela,et al.  Introducing the Semi-stencil Algorithm , 2009, PPAM.

[15]  Lukasz Szustak,et al.  Model-driven adaptation of double-precision matrix multiplication to the Cell processor architecture , 2012, Parallel Comput..

[16]  Samuel Williams,et al.  Optimization and Performance Modeling of Stencil Computations on Modern Microprocessors , 2007, SIAM Rev..

[17]  Piotr K. Smolarkiewicz,et al.  Multidimensional positive definite advection transport algorithm: an overview , 2006 .

[18]  Mikolaj Dobski,et al.  Parallel and GPU Based Strategies for Selected CFD and Climate Modeling Models , 2011, ITEE.

[19]  Chau-Wen Tseng,et al.  Tiling Optimizations for 3D Scientific Computations , 2000, ACM/IEEE SC 2000 Conference (SC'00).

[20]  Lukasz Szustak,et al.  Parallelization of EULAG Model on Multicore Architectures with GPU Accelerators , 2011, PPAM.

[21]  Chris H. Q. Ding,et al.  A ghost cell expansion method for reducing communications in solving PDE problems , 2001, SC.

[22]  Jack Dongarra,et al.  Scientific Computing with Multicore and Accelerators , 2010, Chapman and Hall / CRC computational science series.

[23]  A. R. Surve,et al.  Energy Awareness in HPC: A Survey , 2013 .

[24]  Gerhard Wellein,et al.  Introduction to High Performance Computing for Scientists and Engineers , 2010, Chapman and Hall / CRC computational science series.

[25]  Richard W. Vuduc,et al.  Tuned and wildly asynchronous stencil kernels for hybrid CPU/GPU systems , 2009, ICS.

[26]  Pradeep Dubey,et al.  3.5-D Blocking Optimization for Stencil Computations on Modern CPUs and GPUs , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[27]  W. Grabowski,et al.  The multidimensional positive definite advection transport algorithm: nonoscillatory option , 1990 .

[28]  Gerhard Wellein,et al.  Leveraging Shared Caches for Parallel Temporal Blocking of Stencil Codes on Multicore Processors and Clusters , 2010, Parallel Process. Lett..

[29]  Samuel Williams,et al.  An auto-tuning framework for parallel multicore stencil computations , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[30]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[31]  Michal Kierzynka,et al.  CaKernel --A parallel application programming framework for heterogenous computing architectures , 2011 .

[32]  Scott B. Baden,et al.  Mint: realizing CUDA performance in 3D stencil methods with annotated C , 2011, ICS '11.

[33]  Jack J. Dongarra,et al.  A Note on Auto-tuning GEMM for GPUs , 2009, ICCS.

[34]  Manuel Jesús Castro Díaz,et al.  A multi‐GPU shallow‐water simulation with transport of contaminants , 2013, Concurr. Comput. Pract. Exp..

[35]  Piotr K. Smolarkiewicz,et al.  Towards petascale simulation of atmospheric circulations with soundproof equations , 2011 .

[36]  Roberto Guerrieri,et al.  Triangular Matrix Inversion on Heterogeneous Multicore Systems , 2012, IEEE Transactions on Parallel and Distributed Systems.

[37]  Ken Kennedy Fast greedy weighted fusion , 2000, ICS '00.

[38]  Bogdan Rosa,et al.  A Study on Parallel Performance of the EULAG F90/95 Code , 2011, PPAM.

[39]  Leonid Oliker,et al.  Impact of modern memory subsystems on cache optimizations for stencil computations , 2005, MSP '05.

[40]  Samuel Williams,et al.  Auto-Tuning Stencil Computations on Multicore and Accelerators , 2010, Scientific Computing with Multicore and Accelerators.

[41]  Andrzej A. Wyszogrodzki,et al.  Parallel Implementation and Scalability of Cloud Resolving EULAG Model , 2011, PPAM.

[42]  Samuel Williams,et al.  Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[43]  L. Margolin,et al.  MPDATA: A Finite-Difference Solver for Geophysical Flows , 1998 .

[44]  Jack J. Dongarra,et al.  From CUDA to OpenCL: Towards a performance-portable solution for multi-platform GPU programming , 2012, Parallel Comput..