Fast Matrix-Free Evaluation of Discontinuous Galerkin Finite Element Operators

We present an algorithmic framework for matrix-free evaluation of discontinuous Galerkin finite element operators. It relies on fast quadrature with sum factorization on quadrilateral and hexahedral meshes, targeting general weak forms of linear and nonlinear partial differential equations. Different algorithms and data structures are compared in an in-depth performance analysis. The implementations of the local integrals are optimized by vectorization over several cells and faces and an even-odd decomposition of the one-dimensional interpolations. Up to 60% of the arithmetic peak on Intel Haswell, Broadwell, and Knights Landing processors is reached when running from caches and up to 40% of peak when also considering the access to vectors from main memory. On 2×14 Broadwell cores, the throughput is up to 2.2 billion unknowns per second for the 3D Laplacian and up to 4 billion unknowns per second for the 3D advection on affine geometries, close to a simple copy operation at 4.7 billion unknowns per second. Our experiments show that MPI ghost exchange has a considerable impact on performance and we present strategies to mitigate this effect. Finally, various options for evaluating geometry terms and their performance are discussed. Our implementations are publicly available through the deal.II finite element library.

[1]  David A. Kopriva,et al.  Implementing Spectral Methods for Partial Differential Equations , 2009 .

[2]  Samuel Williams,et al.  Roofline: an insightful visual performance model for multicore architectures , 2009, CACM.

[3]  Andrew T. T. McRae,et al.  Automated Generation and Symbolic Manipulation of Tensor Product Finite Elements , 2014, SIAM J. Sci. Comput..

[4]  Sherwin,et al.  Tetrahedral hp Finite Elements : Algorithms and Flow Simulations , 1996 .

[5]  S. Orszag Spectral methods for problems in complex geometries , 1980 .

[6]  Martin Kronbichler,et al.  Efficiency of high‐performance discontinuous Galerkin spectral element methods for under‐resolved turbulent incompressible flows , 2018, 1802.01439.

[7]  Andrew T. T. McRae,et al.  Firedrake: automating the finite element method by composing abstractions , 2015, ACM Trans. Math. Softw..

[8]  Wolfgang A. Wall,et al.  A matrix‐free high‐order discontinuous Galerkin compressible Navier‐Stokes solver: A performance comparison of compressible and incompressible formulations for turbulent incompressible flows , 2018, International Journal for Numerical Methods in Fluids.

[9]  Steffen Müthing,et al.  Automatic Code Generation for High-performance Discontinuous Galerkin Methods on Modern Architectures , 2018, ACM Trans. Math. Softw..

[10]  John R. Rice,et al.  Direct solution of partial difference equations by tensor product methods , 1964 .

[11]  Tamara G. Kolda,et al.  An overview of the Trilinos project , 2005, TOMS.

[12]  Michael Anderson,et al.  On Orienting Edges of Unstructured Two- and Three-Dimensional Meshes , 2015, ACM Trans. Math. Softw..

[13]  G. Karniadakis,et al.  Spectral/hp Element Methods for Computational Fluid Dynamics , 2005 .

[14]  Timothy C. Warburton,et al.  Nodal discontinuous Galerkin methods on graphics processors , 2009, J. Comput. Phys..

[15]  P. Fischer,et al.  High-Order Methods for Incompressible Fluid Flow , 2002 .

[16]  Torsten Hoefler,et al.  Scientific Benchmarking of Parallel Computing Systems Twelve ways to tell the masses when reporting performance results , 2017 .

[18]  David A. Ham,et al.  An Algorithm for the Optimization of Finite Element Integration Loops , 2016, ACM Trans. Math. Softw..

[19]  Gerhard Wellein,et al.  Introduction to High Performance Computing for Scientists and Engineers , 2010, Chapman and Hall / CRC computational science series.

[20]  Alexander Heinecke,et al.  LIBXSMM: Accelerating Small Matrix Multiplications by Runtime Code Generation , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.

[21]  Wolfgang A. Wall,et al.  Comparison of implicit and explicit hybridizable discontinuous Galerkin methods for the acoustic wave equation , 2016 .

[22]  George Em Karniadakis,et al.  TetrahedralhpFinite Elements , 1996 .

[23]  Matthew G. Knepley,et al.  Achieving High Performance with Unified Residual Evaluation , 2013, ArXiv.

[24]  A. Patera A spectral element method for fluid dynamics: Laminar flow in a channel expansion , 1984 .

[25]  James Reinders,et al.  Intel® threading building blocks , 2008 .

[26]  Katharina Kormann,et al.  Fast Matrix-Free Discontinuous Galerkin Kernels on Modern Computer Architectures , 2017, ISC.

[27]  Martin Kronbichler,et al.  A Performance Comparison of Continuous and Discontinuous Galerkin Methods with Fast Multigrid Solvers , 2016, SIAM J. Sci. Comput..

[28]  Timothy C. Warburton,et al.  A GPU-accelerated continuous and discontinuous Galerkin non-hydrostatic atmospheric model , 2019, Int. J. High Perform. Comput. Appl..

[29]  Immo Huismann,et al.  Factorizing the factorization - a spectral-element solver for elliptic equations with linear operation count , 2016, J. Comput. Phys..

[30]  Gerhard Wellein,et al.  LIKWID: Lightweight Performance Tools , 2011, CHPC.

[31]  Jed Brown,et al.  Efficient Nonlinear Solvers for Nodal High-Order Finite Elements in 3D , 2010, J. Sci. Comput..

[32]  Jack Dongarra,et al.  A Proposed API for Batched Basic Linear Algebra Subprograms , 2016 .

[33]  Jed Brown,et al.  pTatin3D: High-Performance Methods for Long-Term Lithospheric Dynamics , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[34]  David Wells,et al.  The deal.II library, Version 9.1 , 2019, J. Num. Math..

[35]  Martin Kronbichler,et al.  Algorithms and data structures for massively parallel generic adaptive finite element codes , 2011, ACM Trans. Math. Softw..

[36]  Katharina Kormann,et al.  Efficient Explicit Time Stepping of High Order Discontinuous Galerkin Schemes for Waves , 2018, SIAM J. Sci. Comput..

[37]  Katharina Kormann,et al.  Parallel Finite Element Operator Application: Graph Partitioning and Coloring , 2011, 2011 IEEE Seventh International Conference on eScience.

[38]  Martin Kronbichler,et al.  A high-order semi-explicit discontinuous Galerkin solver for 3D incompressible flow with application to DNS and LES of turbulent channel flow , 2016, J. Comput. Phys..

[39]  David A. Ham,et al.  Exposing and exploiting structure: optimal code generation for high-order finite element methods , 2017, ArXiv.

[40]  Timothy C. Warburton,et al.  GPU accelerated spectral finite elements on all-hex meshes , 2016, J. Comput. Phys..

[41]  Robert Michael Kirby,et al.  Nektar++: An open-source spectral/hp element framework , 2015, Comput. Phys. Commun..

[42]  Axel Modave,et al.  GPU performance analysis of a nodal discontinuous Galerkin method for acoustic and elastic models , 2016, Comput. Geosci..

[43]  J. Schöberl C++11 Implementation of Finite Elements in NGSolve , 2014 .

[44]  L. E. Carr,et al.  Matrix-Free Polynomial-Based Nonlinear Least Squares Optimized Preconditioning and Its Application to Discontinuous Galerkin Discretizations of the Euler Equations , 2016, J. Sci. Comput..

[45]  Katharina Kormann,et al.  A Time-Space Adaptive Method for the Schrödinger Equation , 2016 .

[46]  Martin Kronbichler,et al.  Efficient High-Order Discontinuous Galerkin Finite Elements with Matrix-Free Implementations , 2018, Advances and New Trends in Environmental Informatics.

[47]  Gerhard Wellein,et al.  LIKWID: A Lightweight Performance-Oriented Tool Suite for x86 Multicore Environments , 2010, 2010 39th International Conference on Parallel Processing Workshops.

[48]  Niklaus Wirth,et al.  Algorithms and Data Structures , 1989, Lecture Notes in Computer Science.

[49]  Avinash Sodani,et al.  Intel Xeon Phi Processor High Performance Programming: Knights Landing Edition 2nd Edition , 2016 .

[50]  Andreas Klöckner,et al.  Loo.py: transformation-based code generation for GPUs and CPUs , 2014, ARRAY@PLDI.

[51]  Martin Kronbichler,et al.  A fast massively parallel two-phase flow solver for microfluidic chip simulation , 2018, Int. J. High Perform. Comput. Appl..

[52]  Andres More,et al.  Intel Xeon Phi Coprocessor High Performance Programming , 2013 .

[53]  J. Hesthaven,et al.  Nodal Discontinuous Galerkin Methods: Algorithms, Analysis, and Applications , 2007 .

[54]  Rémi Abgrall,et al.  High‐order CFD methods: current status and perspective , 2013 .

[55]  Katharina Kormann,et al.  A generic interface for parallel cell-based finite element operator application , 2012 .

[56]  Claus-Dieter Munz,et al.  Explicit Discontinuous Galerkin methods for unsteady problems , 2012 .

[57]  Stefan Turek,et al.  EXA-DUNE: Flexible PDE Solvers, Numerical Methods and Applications , 2014, Euro-Par Workshops.

[58]  Douglas N. Arnold,et al.  Unified Analysis of Discontinuous Galerkin Methods for Elliptic Problems , 2001, SIAM J. Numer. Anal..

[59]  Tsuyoshi Murata,et al.  {m , 1934, ACML.

[60]  Jim Jeffers,et al.  Knights Landing overview , 2016 .

[61]  G. Henry,et al.  LIBXSMM: A High Performance Library for Small Matrix Multiplications , 2015 .

[62]  Steffen Müthing,et al.  High-performance Implementation of Matrix-free High-order Discontinuous Galerkin Methods , 2017, ArXiv.

[63]  Lawrence Mitchell,et al.  A study of vectorization for matrix-free finite element methods , 2019, Int. J. High Perform. Comput. Appl..

[64]  Stefan Turek,et al.  Hardware-Based Efficiency Advances in the EXA-DUNE Project , 2016, Software for Exascale Computing.

[65]  David Wells,et al.  The deal.II library, Version 9.0 , 2018, J. Num. Math..