Some useful optimisations for unstructured computational fluid dynamics codes on multicore and manycore architectures

Abstract This paper presents a number of optimisations for improving the performance of unstructured computational fluid dynamics codes on multicore and manycore architectures such as the Intel Sandy Bridge, Broadwell and Skylake CPUs and the Intel Xeon Phi Knights Corner and Knights Landing manycore processors. We discuss and demonstrate their implementation in two distinct classes of computational kernels: face-based loops represented by the computation of fluxes and cell-based loops representing updates to state vectors. We present the importance of making efficient use of the underlying vector units in both classes of computational kernels with special emphasis on the changes required for vectorising face-based loops and their intrinsic indirect and irregular access patterns. We demonstrate the advantage of different data layouts for cell-centred as well as face data structures and architectural specific optimisations for improving the performance of gather and scatter operations which are prevalent in unstructured mesh applications. The implementation of a software prefetching strategy based on auto-tuning is also shown along with an empirical evaluation on the importance of multithreading for in-order architectures such as Knights Corner. We explore the various memory modes available on the Intel Xeon Phi Knights Landing architecture and present an approach whereby both traditional DRAM as well as MCDRAM interfaces are exploited for maximum performance. We obtain significant full application speed-ups between 2.8 and 3X across the multicore CPUs in two-socket node configurations, 8.6X on the Intel Xeon Phi Knights Corner coprocessor and 5.6X on the Intel Xeon Phi Knights Landing processor in an unstructured finite volume CFD code representative in size and complexity to an industrial application. Program summary Program Title: some_opt_for_unstructured_cfd Program Files doi: http://dx.doi.org/10.17632/zyh2zkf3jw.1 Licensing provisions: GNU General Public License 3 (GPL) Programming language: C/C++ Nature of problem: The solution of fluid flow problems in the vicinity of complex geometries mandates the utilisation of unstructured grids. However, this flexibility of unstructured mesh methods in dealing with complicated geometries comes at a cost of increased difficulty in extracting high performance out of modern processors. We provide implementations for a number of optimisations useful for improving the performance of unstructured CFD codes on modern multicore and manycore architectures. Solution method: grid renumbering via Reverse Cuthill–Mckee, code transformations necessary for enabling vectorisation, face colouring/reordering for removing dependencies at the face end-points when accumulating residuals, data layout transformations for reducing cache misses, hand-tuned gather and scatter primitives for in-register transpositions, software prefetching via auto-tuning and multithreading for exploiting SMT features of modern processors.

[1]  Yen-Chen Liu,et al.  Knights Landing: Second-Generation Intel Xeon Phi Product , 2016, IEEE Micro.

[2]  C. Hirsch,et al.  Numerical Computation of Internal and External Flows. By C. HIRSCH. Wiley. Vol. 1, Fundamentals of Numerical Discretization. 1988. 515 pp. £60. Vol. 2, Computational Methods for Inviscid and Viscous Flows. 1990, 691 pp. £65. , 1991, Journal of Fluid Mechanics.

[3]  Emilie Sauret,et al.  ASME Turbo Expo 2016: Turbomachinery Technical Conference and Exposition , 2016 .

[4]  Gihan R. Mudalige,et al.  Vectorizing unstructured mesh computations for many‐core architectures , 2016, Concurr. Comput. Pract. Exp..

[5]  David E. Keyes,et al.  Unstructured computational aerodynamics on many integrated core architecture , 2014, Parallel Comput..

[6]  Paul H. J. Kelly,et al.  Acceleration of a Full-Scale Industrial CFD Application with OP2 , 2014, IEEE Transactions on Parallel and Distributed Systems.

[7]  Scott Lathrop,et al.  Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis , 2011, International Conference on High Performance Computing.

[8]  D. Wilcox Reassessment of the scale-determining equation for advanced turbulence models , 1988 .

[9]  Anoop Gupta,et al.  Design and evaluation of a compiler algorithm for prefetching , 1992, ASPLOS V.

[10]  Luca di Mare,et al.  Low Frequency Distortion in Civil Aero-engine Intake , 2016 .

[11]  David G. MacManus,et al.  Ground vortex aerodynamics under crosswind conditions , 2011 .

[12]  Rainald Löhner Cache‐efficient renumbering for vectorization , 2010 .

[13]  M B Giles,et al.  Trends in high-performance computing for engineering calculations , 2014, Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences.

[14]  Stephen A. Jarvis,et al.  Exploring SIMD for Molecular Dynamics, Using Intel® Xeon® Processors and Intel® Xeon Phi Coprocessors , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[15]  James Reinders,et al.  Intel Xeon Phi Coprocessor High Performance Programming , 2013 .

[16]  Luca di Mare,et al.  Modern multicore and manycore architectures: Modelling, optimisation and benchmarking a multiblock CFD code , 2016, Comput. Phys. Commun..

[17]  Michael B. Giles,et al.  Renumbering unstructured grids to improve the performance of codes on hierarchical memory machines , 1997 .

[18]  Pradeep Dubey,et al.  Performance optimizations for scalable implicit RANS calculations with SU2 , 2016 .

[19]  Guillaume Houzeaux,et al.  Some useful strategies for unstructured edge‐based solvers on shared memory machines , 2011 .

[20]  William Gropp,et al.  High-performance parallel implicit CFD , 2001, Parallel Comput..

[21]  W. K. Anderson,et al.  Achieving High Sustained Performance in an Unstructured Mesh CFD Application , 1999, ACM/IEEE SC 1999 Conference (SC'99).

[22]  B. V. Leer,et al.  Towards the ultimate conservative difference scheme V. A second-order sequel to Godunov's method , 1979 .

[23]  Eric Darve,et al.  Liszt: A domain specific language for building portable mesh-based PDE solvers , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[24]  Richard W. Vuduc,et al.  When Prefetching Works, When It Doesn’t, and Why , 2012, TACO.

[25]  Paul H. J. Kelly,et al.  Performance analysis of the OP2 framework on many-core architectures , 2011, PERV.