ParCFD 2014 Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms

Hardware-aware design and optimization is crucial in exploiting emerging architectures for PDE-based computational fluid dynamics applications. In this work, we study optimizations aimed at acceleration of OpenFOAM-based applications on emerging hybrid heterogeneous platforms. OpenFOAM uses MPI to provide parallel multi-processor functionality, which scales well on homogeneous systems but does not fully utilize the potential per-node performance on hybrid heterogeneous platforms. In our study, we use two OpenFOAM applications, icoFoam and laplacianFoam, both based on Krylov iterative methods. We propose a number of optimizations of the dominant kernel of the Krylov solver, aimed at acceleration of the overall execution of the applications on modern GPU-accelerated heterogeneous platforms. Experimental results show that the proposed hybrid implementation significantly outperforms the state-of-the-art implementation.

[1]  Alexey L. Lastovetsky,et al.  Accurate Heterogeneous Communication Models and a Software Tool for Their Efficient Estimation , 2010, Int. J. High Perform. Comput. Appl..

[2]  Alexey L. Lastovetsky,et al.  HeteroMPI: Towards a message-passing library for heterogeneous networks of computers , 2006, J. Parallel Distributed Comput..

[3]  Enhua Wu,et al.  Real-time 3D fluid simulation on GPU with complex obstacles , 2004, 12th Pacific Conference on Computer Graphics and Applications, 2004. PG 2004. Proceedings..

[4]  Alexey L. Lastovetsky,et al.  Using Multidimensional Solvers for Optimal Data Partitioning on Dedicated Heterogeneous HPC Platforms , 2011, PaCT.

[5]  Jens Jägersküpper,et al.  A Novel Shared-Memory Thread-Pool Implementation for Hybrid Parallel CFD Solvers , 2011, Euro-Par.

[6]  Alexey L. Lastovetsky,et al.  Dynamic Load Balancing of Parallel Computational Iterative Routines on Platforms with Memory Heterogeneity , 2010, Euro-Par Workshops.

[7]  Ziming Zhong,et al.  FuPerMod: a software tool for the optimization of data-parallel applications on heterogeneous platforms , 2014, The Journal of Supercomputing.

[8]  Luke N. Olson,et al.  Exposing Fine-Grained Parallelism in Algebraic Multigrid Methods , 2012, SIAM J. Sci. Comput..

[9]  Alexey L. Lastovetsky,et al.  Data partitioning for multiprocessors with memory heterogeneity and memory constraints , 2005, Sci. Program..

[10]  Yousef Saad,et al.  Iterative methods for sparse linear systems , 2003 .

[11]  Emil M. Constantinescu,et al.  Multiphysics simulations , 2013, HiPC 2013.

[12]  Alexey L. Lastovetsky,et al.  High Performance Heterogeneous Computing , 2009, Wiley series on parallel and distributed computing.

[13]  Emmanuel Jeannot,et al.  Design and Optimization of Scientific Applications for Highly Heterogeneous and Hierarchical HPC Platforms Using Functional Computation Performance Models , 2014, HiPC 2014.

[14]  Jack Dongarra,et al.  Optimizing Memory-Bound Numerical Kernels on GPU Hardware Accelerators , 2012 .

[15]  William Gropp,et al.  Domain decomposition on parallel computers , 1989, IMPACT Comput. Sci. Eng..

[16]  Emmanuel Agullo,et al.  QR Factorization on a Multicore Node Enhanced with Multiple GPU Accelerators , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[17]  A. D. Gosman,et al.  The computation of compressible and incompressible recirculating flows by a non-iterative implicit scheme , 1986 .

[18]  Paride Dagnaa,et al.  Partnership for Advanced Computing in Europe Evaluation of Multi-threaded OpenFOAM Hybridization for Massively Parallel Architectures , 2013 .

[19]  Constantine D. Polychronopoulos,et al.  Parallel programming and compilers , 1988 .

[20]  Katarzyna Zadarnowska,et al.  Complete PISO and SIMPLE solvers on Graphics Processing Units , 2012, ArXiv.

[21]  H. T. Kung,et al.  Performance Gains in Conjugate Gradient Computation with Linearly Connected GPU Multiprocessors , 2012 .

[22]  James Demmel,et al.  ScaLAPACK: A Portable Linear Algebra Library for Distributed Memory Computers - Design Issues and Performance , 1995, Proceedings of the 1996 ACM/IEEE Conference on Supercomputing.

[23]  François Pellegrini,et al.  PT-Scotch: A tool for efficient parallel graph ordering , 2008, Parallel Comput..

[24]  William Gropp,et al.  High-performance parallel implicit CFD , 2001, Parallel Comput..

[25]  Farshad Khunjush,et al.  Optimization of OpenFOAM's linear solvers on emerging multi-core platforms , 2011, Proceedings of 2011 IEEE Pacific Rim Conference on Communications, Computers and Signal Processing.

[26]  Alexey L. Lastovetsky Heterogeneity in parallel and distributed computing , 2013, J. Parallel Distributed Comput..

[27]  Jack Dongarra,et al.  MPI: The Complete Reference , 1996 .

[28]  Cédric Augonnet,et al.  StarPU: a Runtime System for Scheduling Tasks over Accelerator-Based Multicore Machines , 2010 .

[29]  Alexey L. Lastovetsky,et al.  Distributed Data Partitioning for Heterogeneous Processors Based on Partial Estimation of Their Functional Performance Models , 2009, Euro-Par Workshops.

[30]  George Bosilca,et al.  Open MPI: Goals, Concept, and Design of a Next Generation MPI Implementation , 2004, PVM/MPI.

[31]  Wim Vanroose,et al.  Hiding global synchronization latency in the preconditioned Conjugate Gradient algorithm , 2014, Parallel Comput..

[32]  Alexey L. Lastovetsky,et al.  Column-Based Matrix Partitioning for Parallel Matrix Multiplication on Heterogeneous Processors Based on Functional Performance Models , 2011, Euro-Par Workshops.

[33]  Allen D. Malony,et al.  The Tau Parallel Performance System , 2006, Int. J. High Perform. Comput. Appl..

[34]  Ziming Zhong,et al.  FuPerMod: A Framework for Optimal Data Partitioning for Parallel Scientific Applications on Dedicated Heterogeneous HPC Platforms , 2013, PaCT.

[35]  Alexey L. Lastovetsky,et al.  Heterogeneous Distribution of Computations Solving Linear Algebra Problems on Networks of Heterogeneous Computers , 2001, J. Parallel Distributed Comput..

[36]  Ziming Zhong,et al.  Data Partitioning on Heterogeneous Multicore Platforms , 2011, 2011 IEEE International Conference on Cluster Computing.

[37]  Chao-Tung Yang,et al.  Hybrid CUDA, OpenMP, and MPI parallel programming on multicore GPU clusters , 2011, Comput. Phys. Commun..

[38]  Hrvoje Jasak,et al.  A tensorial approach to computational continuum mechanics using object-oriented techniques , 1998 .

[39]  David Kirk,et al.  NVIDIA cuda software and gpu parallel computing architecture , 2007, ISMM '07.

[40]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[41]  G. R. Mudalige,et al.  OP2: An active library framework for solving unstructured mesh-based applications on multi-core and many-core architectures , 2012, 2012 Innovative Parallel Computing (InPar).

[42]  Alexey L. Lastovetsky,et al.  Data Partitioning with a Functional Performance Model of Heterogeneous Processors , 2007, Int. J. High Perform. Comput. Appl..

[43]  Alexey Lastovetsky,et al.  Towards a Realistic Performance Model for Networks of Heterogeneous Computers , 2005 .

[44]  Joel H. Ferziger,et al.  Computational methods for fluid dynamics , 1996 .

[45]  Brett A. Becker,et al.  Partitioning for Parallel Matrix-Matrix Multiplication with Heterogeneous Processors: The Optimal Solution , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum.

[46]  Alexey L. Lastovetsky,et al.  Design and implementation of self-adaptable parallel algorithms for scientific computing on highly heterogeneous HPC platforms , 2011, ArXiv.

[47]  Yuan Liu Hybrid Parallel Computation of OpenFOAM Solver on Multi-Core Cluster Systems , 2011 .

[48]  Alexey L. Lastovetsky,et al.  Building the functional performance model of a processor , 2006, SAC.

[49]  Naga K. Govindaraju,et al.  GPGPU: general-purpose computation on graphics hardware , 2006, SC.

[50]  J. Shewchuk An Introduction to the Conjugate Gradient Method Without the Agonizing Pain , 1994 .

[51]  Satoshi Matsuoka,et al.  High performance conjugate gradient solver on multi-GPU clusters using hypergraph partitioning , 2010, Computer Science - Research and Development.

[52]  James R. Stewart,et al.  A framework approach for developing parallel adaptive multiphysics applications , 2004 .

[53]  Wolfgang Straßer,et al.  A Parallel Preconditioned Conjugate Gradient Solver for the Poisson Problem on a Multi-GPU Platform , 2010, 2010 18th Euromicro Conference on Parallel, Distributed and Network-based Processing.

[54]  Hrvoje Jasak,et al.  Development of a Generalized Grid Mesh Interface for Turbomachinery simulations with OpenFOAM , 2008 .

[55]  Ziming Zhong,et al.  Data Partitioning on Heterogeneous Multicore and Multi-GPU Systems Using Functional Performance Models of Data-Parallel Applications , 2012, 2012 IEEE International Conference on Cluster Computing.

[56]  Samuel Williams,et al.  Roofline: an insightful visual performance model for multicore architectures , 2009, CACM.

[57]  Ian Buck,et al.  GPU computing with NVIDIA CUDA , 2007, SIGGRAPH Courses.

[58]  Alexey L. Lastovetsky,et al.  Heterogeneous Distribution of Computations While Solving Linear Algebra Problems on Networks of Heterogeneous Computers , 1999, HPCN Europe.

[59]  Alexey Lastovetsky Parallel Simulation of Oil Extraction on Heterogeneous Networks of Computers , 2012 .

[60]  Manolis Papadrakakis,et al.  A new era in scientific computing: Domain decomposition methods in hybrid CPU-GPU architectures , 2011 .

[61]  Anthony T. Chronopoulos,et al.  s-step iterative methods for symmetric linear systems , 1989 .

[62]  Rajesh Bordawekar,et al.  Optimizing Sparse Matrix-Vector Multiplication on GPUs using Compile-time and Run-time Strategies , 2008 .

[63]  F. Harlow,et al.  Numerical Calculation of Time‐Dependent Viscous Incompressible Flow of Fluid with Free Surface , 1965 .

[64]  Guillaume Mercier,et al.  hwloc: A Generic Framework for Managing Hardware Affinities in HPC Applications , 2010, 2010 18th Euromicro Conference on Parallel, Distributed and Network-based Processing.

[65]  Kevin Skadron,et al.  A performance study of general-purpose applications on graphics processors using CUDA , 2008, J. Parallel Distributed Comput..

[66]  Robert M. Farber,et al.  CUDA Application Design and Development , 2011 .

[67]  Alexey L. Lastovetsky,et al.  Data partitioning with a realistic performance model of networks of heterogeneous computers with task size limits , 2004, Third International Symposium on Parallel and Distributed Computing/Third International Workshop on Algorithms, Models and Tools for Parallel Computing on Heterogeneous Networks.

[68]  Alexey L. Lastovetsky,et al.  HeteroMPI+ScaLAPACK: Towards a ScaLAPACK (Dense Linear Solvers) on Heterogeneous Networks of Computers , 2006, HiPC.

[69]  Anoop Gupta,et al.  Parallel computer architecture - a hardware / software approach , 1998 .

[70]  Cédric Augonnet,et al.  StarPU: a unified platform for task scheduling on heterogeneous multicore architectures , 2011, Concurr. Comput. Pract. Exp..

[71]  Alexey L. Lastovetsky,et al.  Two-Dimensional Matrix Partitioning for Parallel Computing on Heterogeneous Processors Based on Their Functional Performance Models , 2009, Euro-Par Workshops.

[72]  Leonel Sousa,et al.  Hierarchical Partitioning Algorithm for Scientific Computing on Highly Heterogeneous CPU + GPU Clusters , 2012, Euro-Par.

[73]  Toni Cortes,et al.  PARAVER: A Tool to Visualize and Analyze Parallel Code , 2007 .

[74]  Hans Werner Meuer,et al.  Top500 Supercomputer Sites , 1997 .

[75]  David A. Patterson,et al.  Computer Architecture: A Quantitative Approach , 1969 .

[76]  Michael Garland,et al.  Efficient Sparse Matrix-Vector Multiplication on CUDA , 2008 .

[77]  Alexey L. Lastovetsky,et al.  Data distribution for dense factorization on computers with memory heterogeneity , 2007, Parallel Comput..

[78]  Jack J. Dongarra,et al.  A Scalable High Performant Cholesky Factorization for Multicore with GPU Accelerators , 2010, VECPAR.

[79]  Guillaume Caumon,et al.  Concurrent number cruncher: a GPU implementation of a general sparse linear solver , 2009, Int. J. Parallel Emergent Distributed Syst..

[80]  Alexey L. Lastovetsky,et al.  Dynamic Load Balancing of Parallel Computational Iterative Routines on Highly Heterogeneous HPC Platforms , 2011, Parallel Process. Lett..