Vectorized OpenCL implementation of numerical integration for higher order finite elements

In our work we analyze computational aspects of the problem of numerical integration in finite element calculations and consider an OpenCL implementation of related algorithms for processors with wide vector registers. As a platform for testing the implementation we choose the PowerXCell processor, being an example of the Cell Broadband Engine (CellBE) architecture. Although the processor is considered old for today’s standards (its design dates back to year 2001), we investigate its performance due to two features that it shares with recent Xeon Phi family of coprocessors: wide vector units and relatively slow connection of computing cores with main global memory. The performed analysis of parallelization options can also be used for designing numerical integration algorithms for other processors with vector registers, such as contemporary x86 microprocessors. We consider higher order finite element approximations and implement the standard algorithm of numerical integration for prismatic elements. Original contributions of the paper include the analysis of data movement and vector operations performed during code execution. Several versions of the implementation are developed and tested in practice.

[1]  Anders Logg,et al.  Automated Solution of Differential Equations by the Finite Element Method: The FEniCS Book , 2012 .

[2]  Robert Strzodka,et al.  Co-processor acceleration of an unmodified parallel solid mechanics code with FEASTGPU , 2009, Int. J. Comput. Sci. Eng..

[3]  Robert Strzodka,et al.  Exploring weak scalability for FEM calculations on a GPU-enhanced cluster , 2007, Parallel Comput..

[4]  Krzysztof Banas Parallelization of Large Scale Adaptive Finite Element Computations , 2003, PPAM.

[5]  Anders Logg,et al.  A compiler for variational forms , 2006, TOMS.

[6]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[7]  Moshe Dubiner Spectral methods on triangles and other domains , 1991 .

[8]  Noriyuki Kushida Element-wise Implementation of Iterative Solvers for FEM Problems on the Cell Processor , 2011, 2011 19th International Euromicro Conference on Parallel, Distributed and Network-Based Processing.

[9]  Krzysztof Banas,et al.  Finite Element Numerical Integration on PowerXCell Processors , 2009, PPAM.

[10]  Gordon Erlebacher,et al.  Porting a high-order finite-element earthquake modeling application to NVIDIA graphics cards using CUDA , 2009, J. Parallel Distributed Comput..

[11]  Krzysztof Banas,et al.  3D finite element numerical integration on GPUs , 2010, ICCS.

[12]  Krzysztof Banas,et al.  Design and development of an adaptive mesh manipulation module for detailed FEM simulation of flows , 2010, ICCS.

[13]  Samuel Williams,et al.  Optimization of sparse matrix-vector multiplication on emerging multicore platforms , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[14]  James Reinders,et al.  Intel Xeon Phi Coprocessor High Performance Programming , 2013 .

[15]  Alexander Düster,et al.  Book Review: Leszek Demkowicz, Computing with hp‐adaptive finite elements, Volume 1, One and two dimensional elliptic and Maxwell problems , 2007 .

[16]  BanaśKrzysztof,et al.  Numerical integration on GPUs for higher order finite elements , 2014 .

[17]  Krzysztof Banas,et al.  Finite Element Numerical Integration on GPUs , 2009, PPAM.

[18]  Victor Eijkhout,et al.  Towards mechanical derivation of Krylov solver libraries , 2010, ICCS.

[19]  Sherwin,et al.  Tetrahedral hp Finite Elements : Algorithms and Flow Simulations , 1996 .

[20]  Timothy C. Warburton,et al.  Nodal discontinuous Galerkin methods on graphics processors , 2009, J. Comput. Phys..

[21]  Krzysztof Banas A Modular Design for Parallel Adaptive Finite Element Computational Kernels , 2004, International Conference on Computational Science.

[22]  James Demmel,et al.  Benchmarking GPUs to tune dense linear algebra , 2008, HiPC 2008.

[23]  Igor Peterlik,et al.  GPU Acceleration of Equations Assembly in Finite Elements Method -- Preliminary Results , 2009 .

[24]  Anders Logg,et al.  DOLFIN: Automated finite element computing , 2010, TOMS.

[25]  David Pardo,et al.  Out-of-core multi-frontal solver for multi-physics hp adaptive problems , 2011, ICCS.

[26]  Dinesh Manocha,et al.  Memory - A memory model for scientific algorithms on graphics processors , 2006, SC.

[27]  Krzysztof Banas,et al.  Higher order FEM numerical integration on GPUs with OpenCL , 2010, Proceedings of the International Multiconference on Computer Science and Information Technology.

[29]  Samuel Williams,et al.  Scientific Computing Kernels on the Cell Processor , 2007, International Journal of Parallel Programming.

[30]  Sean Rul,et al.  An experimental study on performance portability of OpenCL kernels , 2010, HiPC 2010.

[31]  P ? ? ? ? ? ? ? % ? ? ? ? , 1991 .

[32]  Philippe G. Ciarlet,et al.  The finite element method for elliptic problems , 2002, Classics in applied mathematics.

[33]  David A. Ham,et al.  Towards generating optimised finite element solvers for GPUs from high-level specifications , 2010, ICCS.

[34]  Eric Darve,et al.  Application of Assembly of Finite Element Methods on Graphics Processors for Real-Time Elastodynamics , 2011 .

[35]  I. Doležel,et al.  Higher-Order Finite Element Methods , 2003 .

[36]  Robert Michael Kirby,et al.  From h to p efficiently: Implementing finite and spectral/hp element methods to achieve optimal performance for low- and high-order discretisations , 2010, J. Comput. Phys..

[37]  Lukasz Szustak,et al.  Model-driven adaptation of double-precision matrix multiplication to the Cell processor architecture , 2012, Parallel Comput..

[38]  Robert A. van de Geijn,et al.  Designing Linear Algebra Algorithms by Transformation: Mechanizing the Expert Developer , 2012, VECPAR.

[39]  Lukasz Szustak,et al.  Adaptation of Double-Precision Matrix Multiplication to the Cell Broadband Engine Architecture , 2009, PPAM.

[40]  Matthew G. Knepley,et al.  Finite Element Integration on GPUs , 2013, TOMS.

[41]  Michal Mrozowski,et al.  Generation of large finite-element matrices on multiple graphics processors , 2013 .

[42]  Victor M. Calo,et al.  Computational complexity and memory usage for multi-frontal direct solvers used in p finite element analysis , 2011, ICCS.

[43]  Eric Darve,et al.  Assembly of finite element methods on graphics processors , 2011 .

[44]  Jens Markus Melenk,et al.  Fully discrete hp-finite elements , 1999 .

[45]  Michal Mrozowski,et al.  FINITE ELEMENT MATRIX GENERATION ON A GPU , 2012 .

[46]  David A. Ham,et al.  Finite element assembly strategies on multi‐core and many‐core architectures , 2013 .

[47]  Aaftab Munshi,et al.  The OpenCL specification , 2009, 2009 IEEE Hot Chips 21 Symposium (HCS).

[48]  Krzysztof Banaś A Model for Parallel Adaptive Finite Element Software , 2005 .

[49]  Maciej Paszyński,et al.  Computing with hp-ADAPTIVE FINITE ELEMENTS: Volume II Frontiers: Three Dimensional Elliptic and Maxwell Problems with Applications , 2007 .