Unstructured computational aerodynamics on many integrated core architecture

Abstract Shared memory parallelization of the flux kernel of PETSc-FUN3D, an unstructured tetrahedral mesh Euler flow code previously studied for distributed memory and multi-core shared memory, is evaluated on up to 61 cores per node and up to 4 threads per core. We explore several thread-level optimizations to improve flux kernel performance on the state-of-the-art many integrated core (MIC) Intel processor Xeon Phi “Knights Corner,” with a focus on strong thread scaling. While the linear algebraic kernel is bottlenecked by memory bandwidth for even modest numbers of cores sharing a common memory, the flux kernel, which arises in the control volume discretization of the conservation law residuals and in the formation of the preconditioner for the Jacobian by finite-differencing the conservation law residuals, is compute-intensive and is known to exploit effectively contemporary multi-core hardware. We extend study of the performance of the flux kernel to the Xeon Phi in three thread affinity modes, namely scatter, compact, and balanced, in both offload and native mode, with and without various code optimizations to improve alignment and reduce cache coherency penalties. Relative to baseline “out-of-the-box” optimized compilation, code restructuring optimizations provide about 3.8x speedup using the offload mode and about 5x speedup using the native mode. Even with these gains for the flux kernel, with respect to execution time the MIC simply achieves par with optimized compilation on a contemporary multi-core Intel CPU, the 16-core Sandy Bridge E5 2670. Nevertheless, the optimizations employed to reduce the data motion and cache coherency protocol penalties of the MIC are expected to be of value for CFD and many other unstructured applications as many-core architecture evolves. We explore large-scale distributed-shared memory performance on the Cray XC40 supercomputer, to demonstrate that optimizations employed on Phi hybridize to this context, where each of thousands of nodes are comprised of two sockets of Intel Xeon Haswell CPUs with 32 cores per node.

[1]  David E. Keyes,et al.  Pseudotransient Continuation and Differential-Algebraic Equations , 2003, SIAM J. Sci. Comput..

[2]  William Gropp,et al.  Globalized Newton-Krylov-Schwarz Algorithms and Software for Parallel Implicit CFD , 2000, Int. J. High Perform. Comput. Appl..

[3]  Alexander Heinecke,et al.  Towards High-Performance Optimizations of the Unstructured Open-Source SU2 Suite , 2015 .

[4]  Fan Ye,et al.  The Exploration of Pervasive and Fine-Grained Parallel Model Applied on Intel Xeon Phi Coprocessor , 2013, 2013 Eighth International Conference on P2P, Parallel, Grid, Cloud and Internet Computing.

[5]  David E. Keyes,et al.  Hybrid Programming Model for Implicit PDE Simulations on Multicore Architectures , 2011, IWOMP.

[6]  Paul H. J. Kelly,et al.  Performance analysis of the OP2 framework on many-core architectures , 2011, PERV.

[7]  W. K. Anderson,et al.  Implicit/Multigrid Algorithms for Incompressible Turbulent Flows on Unstructured Grids , 1995 .

[8]  Vipin Kumar,et al.  A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs , 1998, SIAM J. Sci. Comput..

[9]  Pradeep Dubey,et al.  Exploring Shared-Memory Optimizations for an Unstructured Mesh CFD Application on Modern Parallel Systems , 2015, 2015 IEEE International Parallel and Distributed Processing Symposium.

[10]  David E. Keyes,et al.  Prospects for CFD on Petaflops Systems , 1997 .

[11]  W. K. Anderson,et al.  An implicit upwind algorithm for computing turbulent flows on unstructured grids , 1994 .

[12]  Rezaur Rahman,et al.  Intel Xeon Phi Coprocessor Architecture and Tools: The Guide for Application Developers , 2013 .

[13]  Stephen A. Jarvis,et al.  Exploring SIMD for Molecular Dynamics , 2013 .

[14]  Gihan R. Mudalige,et al.  Vectorizing unstructured mesh computations for many‐core architectures , 2016, Concurr. Comput. Pract. Exp..

[15]  Xinmin Tian,et al.  Practical SIMD Vectorization Techniques for Intel® Xeon Phi Coprocessors , 2013, 2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum.

[16]  Jianbin Fang,et al.  An Empirical Study of Intel Xeon Phi , 2013, ArXiv.

[17]  Stephen A. Jarvis,et al.  Exploring SIMD for Molecular Dynamics, Using Intel® Xeon® Processors and Intel® Xeon Phi Coprocessors , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[18]  D. Keyes,et al.  Jacobian-free Newton-Krylov methods: a survey of approaches and applications , 2004 .

[19]  Ewing L. Lusk,et al.  Early Experiments with the OpenMP/MPI Hybrid Programming Model , 2008, IWOMP.

[20]  Ravi Narayanaswamy,et al.  Offload Compiler Runtime for the Intel® Xeon Phi Coprocessor , 2013, 2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum.

[21]  Geoffrey C. Fox,et al.  Fortran 90D/HPF compiler for distributed memory MIMD computers: design, implementation, and performance results , 1993, Supercomputing '93.

[22]  Graph Topology MPI at Exascale , 2010 .

[23]  C. Kelley,et al.  Convergence Analysis of Pseudo-Transient Continuation , 1998 .

[24]  Jesper Larsson Träff,et al.  MPI on a Million Processors , 2009, PVM/MPI.

[25]  Georg Hager,et al.  Hybrid MPI/OpenMP Parallel Programming on Clusters of Multi-Core SMP Nodes , 2009, 2009 17th Euromicro International Conference on Parallel, Distributed and Network-based Processing.

[26]  E. Cuthill,et al.  Reducing the bandwidth of sparse symmetric matrices , 1969, ACM '69.

[27]  Emre Kultursay,et al.  Compiler-Based Data Prefetching and Streaming Non-temporal Store Generation for the Intel(R) Xeon Phi(TM) Coprocessor , 2013, 2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum.

[28]  Y. Saad,et al.  GMRES: a generalized minimal residual algorithm for solving nonsymmetric linear systems , 1986 .

[29]  Dean M. Tullsen,et al.  Simultaneous multithreading: Maximizing on-chip parallelism , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[30]  Eric J. Nielsen,et al.  Production Level CFD Code Acceleration for Hybrid Many-Core Architectures , 2012 .

[31]  Ravi Narayanaswamy,et al.  Offload Compiler Runtime for the Intel® Xeon Phi Coprocessor , 2013, 2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum.

[32]  Yuzhong Shen,et al.  Energy Evaluation for Applications with Different Thread Affinities on the Intel Xeon Phi , 2014, 2014 International Symposium on Computer Architecture and High Performance Computing Workshop.

[33]  Sanjukta Bhowmick,et al.  Parallel adaptive solvers in compressible petsc-fun3d simulations , 2006 .

[34]  Qing Zhang,et al.  High-Performance Computing on the Intel® Xeon Phi™ , 2014, Springer International Publishing.

[35]  Luca Faust,et al.  Modern Operating Systems , 2016 .

[36]  W. K. Anderson,et al.  Achieving High Sustained Performance in an Unstructured Mesh CFD Application , 1999, ACM/IEEE SC 1999 Conference (SC'99).

[37]  Michael Klemm,et al.  OpenMP Programming on Intel Xeon Phi Coprocessors: An Early Performance Comparison , 2012, MARC@RWTH.

[38]  David A. Patterson,et al.  Computer Architecture, Fifth Edition: A Quantitative Approach , 2011 .

[39]  C. Kelley,et al.  Pseudo-transient continuation and differential-algebraic equations , 2002 .

[40]  Nan Wu,et al.  Utilizing Multiple Xeon Phi Coprocessors on One Compute Node , 2014, ICA3PP.

[41]  D. Birchall,et al.  Computational Fluid Dynamics , 2020, Radial Flow Turbocompressors.

[42]  James Reinders,et al.  Intel Xeon Phi Coprocessor High Performance Programming , 2013 .

[43]  Ümit V. Çatalyürek,et al.  Performance Evaluation of Sparse Matrix Multiplication Kernels on Intel Xeon Phi , 2013, PPAM.

[44]  William Gropp,et al.  Efficient Management of Parallelism in Object-Oriented Numerical Software Libraries , 1997, SciTools.

[45]  Guillaume Houzeaux,et al.  Some useful strategies for unstructured edge‐based solvers on shared memory machines , 2011 .

[46]  Kai Li,et al.  Full correlation matrix analysis of fMRI data on Intel® Xeon Phi™ coprocessors , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[47]  Sabela Ramos,et al.  Modeling communication in cache-coherent SMP systems: a case-study with Xeon Phi , 2013, HPDC.

[48]  William Gropp,et al.  High-performance parallel implicit CFD , 2001, Parallel Comput..