Cross-Loop Optimization of Arithmetic Intensity for Finite Element Local Assembly

The numerical solution of partial differential equations using the finite element method is one of the key applications of high performance computing. Local assembly is its characteristic operation. This entails the execution of a problem-specific kernel to numerically evaluate an integral for each element in the discretized problem domain. Since the domain size can be huge, executing efficient kernels is fundamental. Their op- timization is, however, a challenging issue. Even though affine loop nests are generally present, the short trip counts and the complexity of mathematical expressions make it hard to determine a single or unique sequence of successful transformations. Therefore, we present the design and systematic evaluation of COF- FEE, a domain-specific compiler for local assembly kernels. COFFEE manipulates abstract syntax trees generated from a high-level domain-specific language for PDEs by introducing domain-aware composable optimizations aimed at improving instruction-level parallelism, especially SIMD vectorization, and register locality. It then generates C code including vector intrinsics. Experiments using a range of finite-element forms of increasing complexity show that significant performance improvement is achieved.

[1]  Paul H. J. Kelly,et al.  Optimized code generation for finite element local assembly using symbolic manipulation , 2013, TOMS.

[2]  Garth N. Wells,et al.  Optimizations for quadrature representations of finite element tensors through automated code generation , 2011, TOMS.

[3]  Chun Chen,et al.  Speeding up Nek5000 with autotuning and specialization , 2010, ICS '10.

[4]  Richard Veras,et al.  A stencil compiler for short-vector SIMD architectures , 2013, ICS '13.

[5]  I. Z. Reguly,et al.  Vectorizing Unstructured Mesh Computations for Many-core Architectures , 2014, PMAM'14.

[6]  Uday Bondhugula,et al.  A practical automatic polyhedral parallelizer and locality optimizer , 2008, PLDI '08.

[7]  Robert Michael Kirby,et al.  From h to p efficiently: Implementing finite and spectral/hp element methods to achieve optimal performance for low- and high-order discretisations , 2010, J. Comput. Phys..

[8]  Franz Franchetti,et al.  SPIRAL: Code Generation for DSP Transforms , 2005, Proceedings of the IEEE.

[9]  Sriram Krishnamoorthy,et al.  Performance optimization of tensor contraction expressions for many-body methods in quantum chemistry. , 2009, The journal of physical chemistry. A.

[10]  Matthew G. Knepley,et al.  Finite Element Integration on GPUs , 2013, TOMS.

[11]  Nikolaus A. Adams,et al.  11 PFLOP/s simulations of cloud cavitation collapse , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[12]  Robert J. Harrison,et al.  Model-Driven SIMD Code Generation for a Multi-resolution Tensor Kernel , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[13]  David A. Ham,et al.  Towards generating optimised finite element solvers for GPUs from high-level specifications , 2010, ICCS.

[14]  Matthew G. Knepley,et al.  Optimizing the Evaluation of Finite Element Matrices , 2005, SIAM J. Sci. Comput..

[15]  Lawrence Mitchell,et al.  PyOP2: A High-Level Framework for Performance-Portable Simulations on Unstructured Meshes , 2012, 2012 SC Companion: High Performance Computing, Networking Storage and Analysis.

[16]  Krzysztof Banas,et al.  Vectorized OpenCL implementation of numerical integration for higher order finite elements , 2013, Comput. Math. Appl..

[17]  Anders Logg,et al.  A compiler for variational forms , 2006, TOMS.

[18]  Jack J. Dongarra,et al.  Automatically Tuned Linear Algebra Software , 1998, Proceedings of the IEEE/ACM SC98 Conference.

[19]  Anders Logg,et al.  Unified form language: A domain-specific language for weak formulations of partial differential equations , 2012, TOMS.

[20]  Anders Logg,et al.  Automated Solution of Differential Equations by the Finite Element Method: The FEniCS Book , 2012 .

[21]  Bradley C. Kuszmaul,et al.  The pochoir stencil compiler , 2011, SPAA '11.

[22]  Eric Darve,et al.  Liszt: A domain specific language for building portable mesh-based PDE solvers , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[23]  Steven G. Johnson,et al.  The Design and Implementation of FFTW3 , 2005, Proceedings of the IEEE.

[24]  Markus Püschel,et al.  A Basic Linear Algebra Compiler , 2014, CGO '14.

[25]  Lawrence Mitchell,et al.  Performance-Portable Finite Element Assembly Using PyOP2 and FEniCS , 2013, ISC.