Design and Optimization of OpenFOAM-based CFD Applications for Modern Hybrid and Heterogeneous HPC Platforms

Design and Optimization of OpenFOAM-based CFD Applications for Modern Hybrid and Heterogeneous HPC Platforms Amani AlOnazi The progress of high performance computing platforms is dramatic, and most of the simulations carried out on these platforms result in improvements on one level, yet expose shortcomings of current CFD packages capabilities. Therefore, hardwareaware design and optimizations are crucial towards exploiting modern computing resources. This thesis proposes optimizations aimed at accelerating numerical simulations, which are illustrated in OpenFOAM solvers. A hybrid MPI and GPGPU parallel conjugate gradient linear solver has been designed and implemented to solve the sparse linear algebraic kernel that derives from two CFD solver: icoFoam, which is an incompressible flow solver, and laplacianFoam, which solves the Poisson equation, for e.g., thermal diffusion. A load-balancing step is applied using heterogeneous decomposition, which decomposes the computations taking into account the performance of each computing device and seeking to minimize communication. In addition, we implemented the recently developed pipeline conjugate gradient as an algorithmic improvement, and parallelized it using MPI, GPGPU, and a hybrid technique. While many questions of ultimately attainable per node performance and multi-node scaling remain, the experimental results show that the hybrid implementation of both solvers significantly outperforms state-of-the-art implementations of a widely used open source package.

[1]  Alexey Lastovetsky,et al.  Towards a Realistic Performance Model for Networks of Heterogeneous Computers , 2005 .

[2]  F. Harlow,et al.  Numerical Calculation of Time‐Dependent Viscous Incompressible Flow of Fluid with Free Surface , 1965 .

[3]  Emil M. Constantinescu,et al.  Multiphysics simulations , 2013, HiPC 2013.

[4]  Alexey L. Lastovetsky,et al.  High Performance Heterogeneous Computing , 2009, Wiley series on parallel and distributed computing.

[5]  Rajesh Bordawekar,et al.  Optimizing Sparse Matrix-Vector Multiplication on GPUs using Compile-time and Run-time Strategies , 2008 .

[6]  Alexey L. Lastovetsky,et al.  HeteroMPI+ScaLAPACK: Towards a ScaLAPACK (Dense Linear Solvers) on Heterogeneous Networks of Computers , 2006, HiPC.

[7]  James Demmel,et al.  ScaLAPACK: A Portable Linear Algebra Library for Distributed Memory Computers - Design Issues and Performance , 1995, PARA.

[8]  Anthony Skjellum,et al.  Portable Parallel Programming with the Message-Passing Interface , 1996 .

[9]  Leonel Sousa,et al.  Hierarchical Partitioning Algorithm for Scientific Computing on Highly Heterogeneous CPU + GPU Clusters , 2012, Euro-Par.

[10]  Hrvoje Jasak,et al.  Error analysis and estimation for the finite volume method with applications to fluid flows , 1996 .

[11]  Jens Jägersküpper,et al.  A Novel Shared-Memory Thread-Pool Implementation for Hybrid Parallel CFD Solvers , 2011, Euro-Par.

[12]  Wolfgang Straßer,et al.  A Parallel Preconditioned Conjugate Gradient Solver for the Poisson Problem on a Multi-GPU Platform , 2010, 2010 18th Euromicro Conference on Parallel, Distributed and Network-based Processing.

[13]  James R. Stewart,et al.  A framework approach for developing parallel adaptive multiphysics applications , 2004 .

[14]  Guillaume Mercier,et al.  hwloc: A Generic Framework for Managing Hardware Affinities in HPC Applications , 2010, 2010 18th Euromicro Conference on Parallel, Distributed and Network-based Processing.

[15]  P. Schröder,et al.  Sparse matrix solvers on the GPU: conjugate gradients and multigrid , 2003, SIGGRAPH Courses.

[16]  Brett A. Becker,et al.  Partitioning for Parallel Matrix-Matrix Multiplication with Heterogeneous Processors: The Optimal Solution , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum.

[17]  Yuan Liu Hybrid Parallel Computation of OpenFOAM Solver on Multi-Core Cluster Systems , 2011 .

[18]  Alexey L. Lastovetsky Heterogeneity in parallel and distributed computing , 2013, J. Parallel Distributed Comput..

[19]  Alexey L. Lastovetsky,et al.  Data partitioning for multiprocessors with memory heterogeneity and memory constraints , 2005, Sci. Program..

[20]  S. Sitharama Iyengar,et al.  Introduction to parallel algorithms , 1998, Wiley series on parallel and distributed computing.

[21]  Alexey L. Lastovetsky,et al.  Data distribution for dense factorization on computers with memory heterogeneity , 2007, Parallel Comput..

[22]  Farshad Khunjush,et al.  Optimization of OpenFOAM's linear solvers on emerging multi-core platforms , 2011, Proceedings of 2011 IEEE Pacific Rim Conference on Communications, Computers and Signal Processing.

[23]  J. Shewchuk An Introduction to the Conjugate Gradient Method Without the Agonizing Pain , 1994 .

[24]  Alexey L. Lastovetsky,et al.  Data Partitioning with a Functional Performance Model of Heterogeneous Processors , 2007, Int. J. High Perform. Comput. Appl..

[25]  Ziming Zhong,et al.  Data Partitioning on Heterogeneous Multicore Platforms , 2011, 2011 IEEE International Conference on Cluster Computing.

[26]  Robert M. Farber,et al.  CUDA Application Design and Development , 2011 .

[27]  Liu You,et al.  Real-Time 3D Fluid Simulation on GPU with Complex Obstacles , 2006 .

[28]  Emmanuel Agullo,et al.  QR Factorization on a Multicore Node Enhanced with Multiple GPU Accelerators , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[29]  Guillaume Caumon,et al.  Concurrent number cruncher: a GPU implementation of a general sparse linear solver , 2009, Int. J. Parallel Emergent Distributed Syst..

[30]  Satoshi Matsuoka,et al.  High performance conjugate gradient solver on multi-GPU clusters using hypergraph partitioning , 2010, Computer Science - Research and Development.

[31]  Jack J. Dongarra,et al.  A Scalable High Performant Cholesky Factorization for Multicore with GPU Accelerators , 2010, VECPAR.

[32]  Alexey L. Lastovetsky,et al.  Heterogeneous Distribution of Computations Solving Linear Algebra Problems on Networks of Heterogeneous Computers , 2001, J. Parallel Distributed Comput..

[33]  Alexey L. Lastovetsky,et al.  Dynamic Load Balancing of Parallel Computational Iterative Routines on Highly Heterogeneous HPC Platforms , 2011, Parallel Process. Lett..

[34]  Cédric Augonnet,et al.  StarPU: a unified platform for task scheduling on heterogeneous multicore architectures , 2011, Concurr. Comput. Pract. Exp..

[35]  Wim Vanroose,et al.  Hiding global synchronization latency in the preconditioned Conjugate Gradient algorithm , 2014, Parallel Comput..

[36]  Christina Freytag,et al.  Using Mpi Portable Parallel Programming With The Message Passing Interface , 2016 .

[37]  William Gropp,et al.  High-performance parallel implicit CFD , 2001, Parallel Comput..

[38]  Jack Dongarra,et al.  Optimizing Memory-Bound Numerical Kernels on GPU Hardware Accelerators , 2012 .

[39]  Katarzyna Zadarnowska,et al.  Complete PISO and SIMPLE solvers on Graphics Processing Units , 2012, ArXiv.

[40]  G. R. Mudalige,et al.  OP2: An active library framework for solving unstructured mesh-based applications on multi-core and many-core architectures , 2012, 2012 Innovative Parallel Computing (InPar).

[41]  Alexey Lastovetsky Parallel Simulation of Oil Extraction on Heterogeneous Networks of Computers , 2012 .

[42]  Alexey L. Lastovetsky,et al.  Data partitioning with a realistic performance model of networks of heterogeneous computers with task size limits , 2004, Third International Symposium on Parallel and Distributed Computing/Third International Workshop on Algorithms, Models and Tools for Parallel Computing on Heterogeneous Networks.

[43]  Sophie Papst,et al.  Computational Methods For Fluid Dynamics , 2016 .

[44]  H. T. Kung,et al.  Performance Gains in Conjugate Gradient Computation with Linearly Connected GPU Multiprocessors , 2012 .

[45]  Hrvoje Jasak,et al.  A tensorial approach to computational continuum mechanics using object-oriented techniques , 1998 .

[46]  Kevin Skadron,et al.  A performance study of general-purpose applications on graphics processors using CUDA , 2008, J. Parallel Distributed Comput..

[47]  Toni Cortes,et al.  PARAVER: A Tool to Visualize and Analyze Parallel Code , 2007 .

[48]  D. Birchall,et al.  Computational Fluid Dynamics , 2020, Radial Flow Turbocompressors.

[49]  David A. Patterson,et al.  Computer Architecture: A Quantitative Approach , 1969 .

[50]  Alexey L. Lastovetsky,et al.  Heterogeneous Distribution of Computations While Solving Linear Algebra Problems on Networks of Heterogeneous Computers , 1999, HPCN Europe.

[51]  Alexey L. Lastovetsky,et al.  Dynamic Load Balancing of Parallel Computational Iterative Routines on Platforms with Memory Heterogeneity , 2010, Euro-Par Workshops.

[52]  David Skinner,et al.  Capturing and Visualizing Event Flow Graphs of MPI Applications , 2009, Euro-Par Workshops.

[53]  Chao-Tung Yang,et al.  Hybrid CUDA, OpenMP, and MPI parallel programming on multicore GPU clusters , 2011, Comput. Phys. Commun..

[54]  George Bosilca,et al.  Open MPI: Goals, Concept, and Design of a Next Generation MPI Implementation , 2004, PVM/MPI.

[55]  Yousef Saad,et al.  Iterative methods for sparse linear systems , 2003 .

[56]  Emmanuel Jeannot,et al.  Design and Optimization of Scientific Applications for Highly Heterogeneous and Hierarchical HPC Platforms Using Functional Computation Performance Models , 2014, HiPC 2014.

[57]  Ian Buck,et al.  GPU computing with NVIDIA CUDA , 2007, SIGGRAPH Courses.

[58]  Alexey L. Lastovetsky,et al.  Design and implementation of self-adaptable parallel algorithms for scientific computing on highly heterogeneous HPC platforms , 2011, ArXiv.

[59]  Jack Dongarra,et al.  MPI: The Complete Reference , 1996 .

[60]  Michael Garland,et al.  Efficient Sparse Matrix-Vector Multiplication on CUDA , 2008 .

[61]  Alexey L. Lastovetsky,et al.  Two-Dimensional Matrix Partitioning for Parallel Computing on Heterogeneous Processors Based on Their Functional Performance Models , 2009, Euro-Par Workshops.

[62]  Naga K. Govindaraju,et al.  GPGPU: general-purpose computation on graphics hardware , 2006, SC.

[63]  David Kirk,et al.  NVIDIA cuda software and gpu parallel computing architecture , 2007, ISMM '07.

[64]  Alexey L. Lastovetsky,et al.  Using Multidimensional Solvers for Optimal Data Partitioning on Dedicated Heterogeneous HPC Platforms , 2011, PaCT.

[65]  Hrvoje Jasak,et al.  Development of a Generalized Grid Mesh Interface for Turbomachinery simulations with OpenFOAM , 2008 .

[66]  Constantine D. Polychronopoulos,et al.  Parallel programming and compilers , 1988 .

[67]  François Pellegrini,et al.  PT-Scotch: A tool for efficient parallel graph ordering , 2008, Parallel Comput..

[68]  Ziming Zhong,et al.  Data Partitioning on Heterogeneous Multicore and Multi-GPU Systems Using Functional Performance Models of Data-Parallel Applications , 2012, 2012 IEEE International Conference on Cluster Computing.

[69]  Alexey L. Lastovetsky,et al.  Column-Based Matrix Partitioning for Parallel Matrix Multiplication on Heterogeneous Processors Based on Functional Performance Models , 2011, Euro-Par Workshops.

[70]  Paride Dagnaa,et al.  Partnership for Advanced Computing in Europe Evaluation of Multi-threaded OpenFOAM Hybridization for Massively Parallel Architectures , 2013 .

[71]  William Gropp,et al.  Domain decomposition on parallel computers , 1989, IMPACT Comput. Sci. Eng..

[72]  Alexey L. Lastovetsky,et al.  HeteroMPI: Towards a message-passing library for heterogeneous networks of computers , 2006, J. Parallel Distributed Comput..

[73]  Hugh Garraway Parallel Computer Architecture: A Hardware/Software Approach , 1999, IEEE Concurrency.

[74]  Alexey L. Lastovetsky,et al.  Building the functional performance model of a processor , 2006, SAC.

[75]  Ziming Zhong,et al.  FuPerMod: A Framework for Optimal Data Partitioning for Parallel Scientific Applications on Dedicated Heterogeneous HPC Platforms , 2013, PaCT.

[76]  Manolis Papadrakakis,et al.  A new era in scientific computing: Domain decomposition methods in hybrid CPU-GPU architectures , 2011 .

[77]  Allen D. Malony,et al.  The Tau Parallel Performance System , 2006, Int. J. High Perform. Comput. Appl..

[78]  Alexey L. Lastovetsky,et al.  Distributed Data Partitioning for Heterogeneous Processors Based on Partial Estimation of Their Functional Performance Models , 2009, Euro-Par Workshops.

[79]  Cédric Augonnet,et al.  StarPU: a Runtime System for Scheduling Tasks over Accelerator-Based Multicore Machines , 2010 .