Task‐based parallel strategies for computational fluid dynamic application in heterogeneous CPU/GPU resources

Parallel applications executing in contemporary heterogeneous clusters are complex to code and optimize. The task‐based programming model is an alternative to handle the coding complexity. This model consists of splitting the problem domain into tasks with dependencies through a directed acyclic graph, and submit the set of tasks to a runtime scheduler that maps each task dynamically to resources. We consider that computational fluid dynamics applications are typical in scientific computing but not enough exploited by designs that employ the task‐based programming model. This article presents task‐based parallel strategies for a simple CFD application that targets heterogeneous multi‐CPU/multi‐GPU computing resources. We design, develop, evaluate, and compare the performance of three parallel strategies (naive, ghost‐cells, and arrow) of a task‐based heterogeneous (CPU and GPU) application that simulates the flow of an incompressible Newtonian fluid with constant viscosity. All implementations rely on the StarPU runtime, and we use the StarVZ toolkit to conduct comprehensive performance analysis. Results indicate that the ghost cell strategy provides the best speedup (77×) considering the simulation time when the GPU resources still have available memory. However, the arrow strategy achieves better results when the simulation data increases.

[1]  Thomas Hérault,et al.  DAGuE: A Generic Distributed DAG Engine for High Performance Computing , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.

[2]  John E. Stone,et al.  OpenCL: A Parallel Programming Standard for Heterogeneous Computing Systems , 2010, Computing in Science & Engineering.

[3]  Jack Dongarra,et al.  The TOP500: History, Trends, and Future Directions in High Performance Computing , 2020 .

[4]  Emmanuel Jeannot,et al.  Experimenting task-based runtimes on a legacy Computational Fluid Dynamics code with unstructured meshes , 2018, Computers & Fluids.

[5]  David G. Wonnacott,et al.  Using time skewing to eliminate idle time due to memory bandwidth and network limitations , 2000, Proceedings 14th International Parallel and Distributed Processing Symposium. IPDPS 2000.

[6]  Jean Roman,et al.  Design and Analysis of a Task-based Parallelization over a Runtime System of an Explicit Finite-Volume CFD Code with Adaptive Time Stepping , 2017, J. Comput. Sci..

[7]  Paulius Micikevicius,et al.  3D finite difference computation on GPUs using CUDA , 2009, GPGPU-2.

[8]  L. Dagum,et al.  OpenMP: an industry standard API for shared-memory programming , 1998 .

[9]  Emmanuel Agullo,et al.  Implementing Multifrontal Sparse Solvers for Multicore Architectures with Sequential Task Flow Runtime Systems , 2016, ACM Trans. Math. Softw..

[10]  Christina Freytag,et al.  Using Mpi Portable Parallel Programming With The Message Passing Interface , 2016 .

[11]  C. Xie Interactive Heat Transfer Simulations for Everyone , 2012 .

[12]  Arch D. Robison,et al.  Intel® Threading Building Blocks (TBB) , 2011, Encyclopedia of Parallel Computing.

[13]  Jairo Panetta,et al.  Evaluating optimizations that reduce global memory accesses of stencil computations in GPGPUs , 2019, Concurr. Comput. Pract. Exp..

[14]  Asif Afzal,et al.  Parallelization Strategies for Computational Fluid Dynamics Software: State of the Art Review , 2016, Archives of Computational Methods in Engineering.

[15]  Albert Farrés,et al.  Optimization strategies for geophysics models on manycore systems , 2019, Int. J. High Perform. Comput. Appl..

[16]  Lucas Mello Schnorr,et al.  Design, Implementation and Performance Analysis of a CFD Task-Based Application for Heterogeneous CPU/GPU Resources , 2018, VECPAR.

[17]  OlukotunKunle,et al.  A domain-specific approach to heterogeneous parallelism , 2011 .

[18]  Bradley C. Kuszmaul,et al.  Cilk: an efficient multithreaded runtime system , 1995, PPOPP '95.

[19]  Cédric Augonnet,et al.  StarPU: a unified platform for task scheduling on heterogeneous multicore architectures , 2011, Concurr. Comput. Pract. Exp..

[20]  Alfredo Buttari,et al.  Fine Granularity Sparse QR Factorization for Multicore Based Systems , 2010, PARA.

[21]  Qiqi Wang,et al.  The swept rule for breaking the latency barrier in time advancing two-dimensional PDEs , 2016, ArXiv.

[22]  AgulloEmmanuel,et al.  Task-based FMM for heterogeneous architectures , 2016 .

[23]  M. Snir,et al.  Ghost Cell Pattern , 2010, ParaPLoP '10.

[24]  Jack Dongarra,et al.  Faster, Cheaper, Better { a Hybridization Methodology to Develop Linear Algebra Software for GPUs , 2010 .

[25]  Kunle Olukotun,et al.  A domain-specific approach to heterogeneous parallelism , 2011, PPoPP '11.

[26]  Samuel Thibault,et al.  On Runtime Systems for Task-based Programming on Heterogeneous Platforms , 2018 .

[27]  Philippe Thierry,et al.  Characterization and Optimization Methodology Applied to Stencil Computations , 2015 .

[28]  Inanc Senocak,et al.  An MPI-CUDA Implementation for Massively Parallel Incompressible Flow Computations on Multi-GPU Clusters , 2010 .

[29]  Bruno Raffin,et al.  XKaapi: A Runtime System for Data-Flow Task Programming on Heterogeneous Architectures , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[30]  Qiqi Wang,et al.  The swept rule for breaking the latency barrier in time advancing PDEs , 2015, J. Comput. Phys..

[31]  Lucas Mello Schnorr,et al.  Visual Performance Analysis of Memory Behavior in a Task-Based Runtime on Hybrid Platforms , 2019, 2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID).

[32]  Lucas Mello Schnorr,et al.  A visual performance analysis framework for task‐based parallel applications running on hybrid clusters , 2018, Concurr. Comput. Pract. Exp..

[33]  Emmanuel Agullo,et al.  Task‐based FMM for heterogeneous architectures , 2016, Concurr. Comput. Pract. Exp..

[34]  Scott B. Baden,et al.  Mint: realizing CUDA performance in 3D stencil methods with annotated C , 2011, ICS '11.