论文信息 - ParCFD 2014 Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms

ParCFD 2014 Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms

Hardware-aware design and optimization is crucial in exploiting emerging architectures for PDE-based computational fluid dynamics applications. In this work, we study optimizations aimed at acceleration of OpenFOAM-based applications on emerging hybrid heterogeneous platforms. OpenFOAM uses MPI to provide parallel multi-processor functionality, which scales well on homogeneous systems but does not fully utilize the potential per-node performance on hybrid heterogeneous platforms. In our study, we use two OpenFOAM applications, icoFoam and laplacianFoam, both based on Krylov iterative methods. We propose a number of optimizations of the dominant kernel of the Krylov solver, aimed at acceleration of the overall execution of the applications on modern GPU-accelerated heterogeneous platforms. Experimental results show that the proposed hybrid implementation significantly outperforms the state-of-the-art implementation.

[1] Alexey L. Lastovetsky,et al. Accurate Heterogeneous Communication Models and a Software Tool for Their Efficient Estimation , 2010, Int. J. High Perform. Comput. Appl..

[2] Alexey L. Lastovetsky,et al. HeteroMPI: Towards a message-passing library for heterogeneous networks of computers , 2006, J. Parallel Distributed Comput..

[3] Enhua Wu,et al. Real-time 3D fluid simulation on GPU with complex obstacles , 2004, 12th Pacific Conference on Computer Graphics and Applications, 2004. PG 2004. Proceedings..

[4] Alexey L. Lastovetsky,et al. Using Multidimensional Solvers for Optimal Data Partitioning on Dedicated Heterogeneous HPC Platforms , 2011, PaCT.

[5] Jens Jägersküpper,et al. A Novel Shared-Memory Thread-Pool Implementation for Hybrid Parallel CFD Solvers , 2011, Euro-Par.

[6] Alexey L. Lastovetsky,et al. Dynamic Load Balancing of Parallel Computational Iterative Routines on Platforms with Memory Heterogeneity , 2010, Euro-Par Workshops.

[7] Ziming Zhong,et al. FuPerMod: a software tool for the optimization of data-parallel applications on heterogeneous platforms , 2014, The Journal of Supercomputing.

[8] Luke N. Olson,et al. Exposing Fine-Grained Parallelism in Algebraic Multigrid Methods , 2012, SIAM J. Sci. Comput..

[9] Alexey L. Lastovetsky,et al. Data partitioning for multiprocessors with memory heterogeneity and memory constraints , 2005, Sci. Program..

[10] Yousef Saad,et al. Iterative methods for sparse linear systems , 2003 .

[11] Emil M. Constantinescu,et al. Multiphysics simulations , 2013, HiPC 2013.

[12] Alexey L. Lastovetsky,et al. High Performance Heterogeneous Computing , 2009, Wiley series on parallel and distributed computing.

[13] Emmanuel Jeannot,et al. Design and Optimization of Scientific Applications for Highly Heterogeneous and Hierarchical HPC Platforms Using Functional Computation Performance Models , 2014, HiPC 2014.

[14] Jack Dongarra,et al. Optimizing Memory-Bound Numerical Kernels on GPU Hardware Accelerators , 2012 .

[15] William Gropp,et al. Domain decomposition on parallel computers , 1989, IMPACT Comput. Sci. Eng..

[16] Emmanuel Agullo,et al. QR Factorization on a Multicore Node Enhanced with Multiple GPU Accelerators , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[17] A. D. Gosman,et al. The computation of compressible and incompressible recirculating flows by a non-iterative implicit scheme , 1986 .

[18] Paride Dagnaa,et al. Partnership for Advanced Computing in Europe Evaluation of Multi-threaded OpenFOAM Hybridization for Massively Parallel Architectures , 2013 .

[19] Constantine D. Polychronopoulos,et al. Parallel programming and compilers , 1988 .

[20] Katarzyna Zadarnowska,et al. Complete PISO and SIMPLE solvers on Graphics Processing Units , 2012, ArXiv.

[21] H. T. Kung,et al. Performance Gains in Conjugate Gradient Computation with Linearly Connected GPU Multiprocessors , 2012 .

[22] James Demmel,et al. ScaLAPACK: A Portable Linear Algebra Library for Distributed Memory Computers - Design Issues and Performance , 1995, Proceedings of the 1996 ACM/IEEE Conference on Supercomputing.

[23] François Pellegrini,et al. PT-Scotch: A tool for efficient parallel graph ordering , 2008, Parallel Comput..

[24] William Gropp,et al. High-performance parallel implicit CFD , 2001, Parallel Comput..

[25] Farshad Khunjush,et al. Optimization of OpenFOAM's linear solvers on emerging multi-core platforms , 2011, Proceedings of 2011 IEEE Pacific Rim Conference on Communications, Computers and Signal Processing.

[26] Alexey L. Lastovetsky. Heterogeneity in parallel and distributed computing , 2013, J. Parallel Distributed Comput..

[27] Jack Dongarra,et al. MPI: The Complete Reference , 1996 .

[28] Cédric Augonnet,et al. StarPU: a Runtime System for Scheduling Tasks over Accelerator-Based Multicore Machines , 2010 .

[29] Alexey L. Lastovetsky,et al. Distributed Data Partitioning for Heterogeneous Processors Based on Partial Estimation of Their Functional Performance Models , 2009, Euro-Par Workshops.

[30] George Bosilca,et al. Open MPI: Goals, Concept, and Design of a Next Generation MPI Implementation , 2004, PVM/MPI.

[31] Wim Vanroose,et al. Hiding global synchronization latency in the preconditioned Conjugate Gradient algorithm , 2014, Parallel Comput..

[32] Alexey L. Lastovetsky,et al. Column-Based Matrix Partitioning for Parallel Matrix Multiplication on Heterogeneous Processors Based on Functional Performance Models , 2011, Euro-Par Workshops.

[33] Allen D. Malony,et al. The Tau Parallel Performance System , 2006, Int. J. High Perform. Comput. Appl..

[34] Ziming Zhong,et al. FuPerMod: A Framework for Optimal Data Partitioning for Parallel Scientific Applications on Dedicated Heterogeneous HPC Platforms , 2013, PaCT.

[35] Alexey L. Lastovetsky,et al. Heterogeneous Distribution of Computations Solving Linear Algebra Problems on Networks of Heterogeneous Computers , 2001, J. Parallel Distributed Comput..

[36] Ziming Zhong,et al. Data Partitioning on Heterogeneous Multicore Platforms , 2011, 2011 IEEE International Conference on Cluster Computing.

[37] Chao-Tung Yang,et al. Hybrid CUDA, OpenMP, and MPI parallel programming on multicore GPU clusters , 2011, Comput. Phys. Commun..

[38] Hrvoje Jasak,et al. A tensorial approach to computational continuum mechanics using object-oriented techniques , 1998 .

[39] David Kirk,et al. NVIDIA cuda software and gpu parallel computing architecture , 2007, ISMM '07.

[40] G. G. Stokes. "J." , 1890, The New Yale Book of Quotations.

[41] G. R. Mudalige,et al. OP2: An active library framework for solving unstructured mesh-based applications on multi-core and many-core architectures , 2012, 2012 Innovative Parallel Computing (InPar).

[42] Alexey L. Lastovetsky,et al. Data Partitioning with a Functional Performance Model of Heterogeneous Processors , 2007, Int. J. High Perform. Comput. Appl..

[43] Alexey Lastovetsky,et al. Towards a Realistic Performance Model for Networks of Heterogeneous Computers , 2005 .

[44] Joel H. Ferziger,et al. Computational methods for fluid dynamics , 1996 .

[45] Brett A. Becker,et al. Partitioning for Parallel Matrix-Matrix Multiplication with Heterogeneous Processors: The Optimal Solution , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum.

[46] Alexey L. Lastovetsky,et al. Design and implementation of self-adaptable parallel algorithms for scientific computing on highly heterogeneous HPC platforms , 2011, ArXiv.

[47] Yuan Liu. Hybrid Parallel Computation of OpenFOAM Solver on Multi-Core Cluster Systems , 2011 .

[48] Alexey L. Lastovetsky,et al. Building the functional performance model of a processor , 2006, SAC.

[49] Naga K. Govindaraju,et al. GPGPU: general-purpose computation on graphics hardware , 2006, SC.

[50] J. Shewchuk. An Introduction to the Conjugate Gradient Method Without the Agonizing Pain , 1994 .

[51] Satoshi Matsuoka,et al. High performance conjugate gradient solver on multi-GPU clusters using hypergraph partitioning , 2010, Computer Science - Research and Development.

[52] James R. Stewart,et al. A framework approach for developing parallel adaptive multiphysics applications , 2004 .

[53] Wolfgang Straßer,et al. A Parallel Preconditioned Conjugate Gradient Solver for the Poisson Problem on a Multi-GPU Platform , 2010, 2010 18th Euromicro Conference on Parallel, Distributed and Network-based Processing.

[54] Hrvoje Jasak,et al. Development of a Generalized Grid Mesh Interface for Turbomachinery simulations with OpenFOAM , 2008 .

[55] Ziming Zhong,et al. Data Partitioning on Heterogeneous Multicore and Multi-GPU Systems Using Functional Performance Models of Data-Parallel Applications , 2012, 2012 IEEE International Conference on Cluster Computing.

[56] Samuel Williams,et al. Roofline: an insightful visual performance model for multicore architectures , 2009, CACM.

[57] Ian Buck,et al. GPU computing with NVIDIA CUDA , 2007, SIGGRAPH Courses.

[58] Alexey L. Lastovetsky,et al. Heterogeneous Distribution of Computations While Solving Linear Algebra Problems on Networks of Heterogeneous Computers , 1999, HPCN Europe.

[59] Alexey Lastovetsky. Parallel Simulation of Oil Extraction on Heterogeneous Networks of Computers , 2012 .

[60] Manolis Papadrakakis,et al. A new era in scientific computing: Domain decomposition methods in hybrid CPU-GPU architectures , 2011 .

[61] Anthony T. Chronopoulos,et al. s-step iterative methods for symmetric linear systems , 1989 .

[62] Rajesh Bordawekar,et al. Optimizing Sparse Matrix-Vector Multiplication on GPUs using Compile-time and Run-time Strategies , 2008 .

[63] F. Harlow,et al. Numerical Calculation of Time‐Dependent Viscous Incompressible Flow of Fluid with Free Surface , 1965 .

[64] Guillaume Mercier,et al. hwloc: A Generic Framework for Managing Hardware Affinities in HPC Applications , 2010, 2010 18th Euromicro Conference on Parallel, Distributed and Network-based Processing.

[65] Kevin Skadron,et al. A performance study of general-purpose applications on graphics processors using CUDA , 2008, J. Parallel Distributed Comput..

[66] Robert M. Farber,et al. CUDA Application Design and Development , 2011 .

[67] Alexey L. Lastovetsky,et al. Data partitioning with a realistic performance model of networks of heterogeneous computers with task size limits , 2004, Third International Symposium on Parallel and Distributed Computing/Third International Workshop on Algorithms, Models and Tools for Parallel Computing on Heterogeneous Networks.

[68] Alexey L. Lastovetsky,et al. HeteroMPI+ScaLAPACK: Towards a ScaLAPACK (Dense Linear Solvers) on Heterogeneous Networks of Computers , 2006, HiPC.

[69] Anoop Gupta,et al. Parallel computer architecture - a hardware / software approach , 1998 .

[70] Cédric Augonnet,et al. StarPU: a unified platform for task scheduling on heterogeneous multicore architectures , 2011, Concurr. Comput. Pract. Exp..

[71] Alexey L. Lastovetsky,et al. Two-Dimensional Matrix Partitioning for Parallel Computing on Heterogeneous Processors Based on Their Functional Performance Models , 2009, Euro-Par Workshops.

[72] Leonel Sousa,et al. Hierarchical Partitioning Algorithm for Scientific Computing on Highly Heterogeneous CPU + GPU Clusters , 2012, Euro-Par.

[73] Toni Cortes,et al. PARAVER: A Tool to Visualize and Analyze Parallel Code , 2007 .

[74] Hans Werner Meuer,et al. Top500 Supercomputer Sites , 1997 .

[75] David A. Patterson,et al. Computer Architecture: A Quantitative Approach , 1969 .

[76] Michael Garland,et al. Eﬃcient Sparse Matrix-Vector Multiplication on CUDA , 2008 .

[77] Alexey L. Lastovetsky,et al. Data distribution for dense factorization on computers with memory heterogeneity , 2007, Parallel Comput..

[78] Jack J. Dongarra,et al. A Scalable High Performant Cholesky Factorization for Multicore with GPU Accelerators , 2010, VECPAR.

[79] Guillaume Caumon,et al. Concurrent number cruncher: a GPU implementation of a general sparse linear solver , 2009, Int. J. Parallel Emergent Distributed Syst..

[80] Alexey L. Lastovetsky,et al. Dynamic Load Balancing of Parallel Computational Iterative Routines on Highly Heterogeneous HPC Platforms , 2011, Parallel Process. Lett..