Performance modeling and optimization of parallel LU-SGS on many-core processors for 3D high-order CFD simulations

As a typical Gauss–Seidel method, the inherent strong data dependency of lower-upper symmetric Gauss–Seidel (LU-SGS) poses tough challenges for shared-memory parallelization. On early multi-core processors, the pipelined parallel LU-SGS approach achieves promising scalability. However, on emerging many-core processors such as Xeon Phi, experience from our in-house high-order CFD program show that the parallel efficiency drops dramatically to less than 25%. In this paper, we model and analyze the performance of the pipelined parallel LU-SGS algorithm, present a two-level pipeline (TL-Pipeline) approach using nested OpenMP to further exploit fine-grained parallelisms and mitigate the parallel performance bottlenecks. Our TL-Pipeline approach achieves 20% performance gains for a regular problem $$(256\times 256\times 256)$$(256×256×256) on Xeon Phi. We also discuss some practical problems including domain decomposition and algorithm parameters tuning for realistic CFD simulations. Generally, our work is applicable to the shared-memory parallelization of all Gauss–Seidel like methods with intrinsic strong data dependency.

[1]  Xiaogang Deng,et al.  Weighted Compact High-Order Nonlinear Schemes for the Euler Equations , 1997 .

[2]  Z. J. Wang,et al.  Efficient Implicit Non-linear LU-SGS Approach for Compressible Flow Computation Using High-Order Spectral Difference Method , 2008 .

[3]  Rupak Biswas,et al.  An Application-Based Performance Characterization of the Columbia Supercluster , 2005, ACM/IEEE SC 2005 Conference (SC'05).

[4]  M. Jahed Djomehri,et al.  Hybrid MPI+OpenMP Programming of an Overset CFD Solver and Performance Investigations , 2002 .

[5]  A. Jameson,et al.  Lower-upper Symmetric-Gauss-Seidel method for the Euler and Navier-Stokes equations , 1988 .

[6]  Rupak Biswas,et al.  A Detailed Performance Characterization of Columbia using Aeronautics Benchmarks and Applications , 2006 .

[7]  Peter Eliasson,et al.  Convergence Acceleration of the CFD Code Edge by LU-SGS , 2011 .

[8]  Xiang Gao,et al.  Parallelizing and optimizing large‐scale 3D multi‐phase flow simulations on the Tianhe‐2 supercomputer , 2016, Concurr. Comput. Pract. Exp..

[9]  Yi Jiang,et al.  Collaborating CPU and GPU for large-scale high-order CFD simulations with complex grids on the TianHe-1A supercomputer , 2014, J. Comput. Phys..

[10]  Xiaogang Deng,et al.  New High-Order Hybrid Cell-Edge and Cell-Node Weighted Compact Nonlinear Schemes , 2011 .

[11]  Yuewen Jiang,et al.  An Improved LU-SGS Implicit Scheme for High Reynolds Number Flow Computations on Hybrid Unstructured Mesh , 2012 .

[12]  Alexander Heinecke,et al.  Towards High-Performance Optimizations of the Unstructured Open-Source SU2 Suite , 2015 .

[13]  Zhi J. Wang,et al.  Efficient Implicit Non-linear LU-SGS Approach for Viscous Flow Computation Using High-Order Spectral Difference Method , 2007 .

[14]  Kentaro Sano,et al.  Parallel-Implicit Computation of Three-dimensional Multistage Stator-Rotor Cascade Flows with Condensation , 2007 .

[15]  Wei Liu,et al.  CFD high-order accurate scheme Jacobian-Free Newton Krylov method , 2015 .

[16]  Rainald Loehner,et al.  IMPLEMENTATION OF UNSTRUCTURED GRID GMRES+LU-SGS METHOD ON SHARED-MEMORY, CACHE-BASED PARALLEL COMPUTERS , 2000 .

[17]  Zhi J. Wang,et al.  Fast, Block Lower-Upper Symmetric Gauss-Seidel Scheme for Arbitrary Grids , 2000 .

[18]  Jianbin Fang,et al.  Test-driving Intel Xeon Phi , 2014, ICPE.

[19]  Seokkwan Yoon,et al.  Parallelization of Gauss-Seidel Relaxation for Real Gas Flow , 2005 .

[20]  Rainald Löhner,et al.  PARALLEL UNSTRUCTURED GRID GMRES+LU-SGS METHOD FOR TURBULENT FLOWS , 2003 .

[21]  Zhi J. Wang,et al.  A block LU-SGS implicit dual time-stepping algorithm for hybrid dynamic meshes , 2003 .

[22]  Jianbin Fang Towards a Systematic Exploration of the Optimization Space for Many-Core Processors , 2014 .

[23]  Xin,et al.  A Multigrid Block LU-SGS Algorithm for Euler Equations on Unstructured Grids , 2008 .

[24]  Xiaogang Deng,et al.  Developing high-order weighted compact nonlinear schemes , 2000 .