Parallel performance modeling of irregular applications in cell-centered finite volume methods over unstructured tetrahedral meshes

Finite volume methods are widely used numerical strategies for solving partial differential equations. This paper aims at obtaining a quantitative understanding of the achievable performance of the cell-centered finite volume method on 3D unstructured tetrahedral meshes, using traditional multicore CPUs as well as modern GPUs. By using an optimized implementation and a synthetic connectivity matrix that exhibits a perfect structure of equal-sized blocks lying on the main diagonal, we can closely relate the achievable computing performance to the size of these diagonal blocks. Moreover, we have derived a theoretical model for identifying characteristic levels of the attainable performance as a function of hardware parameters, based on which a realistic upper limit of the performance can be predicted accurately. For real-world tetrahedral meshes, the key to high performance lies in a reordering of the tetrahedra, such that the resulting connectivity matrix resembles a block diagonal form where the optimal size of the blocks depends on the hardware. Numerical experiments confirm that the achieved performance is close to the practically attainable maximum and it reaches 75% of the theoretical upper limit, independent of the actual tetrahedral mesh considered. From this, we develop a general model capable of identifying bottleneck performance of a system's memory hierarchy in irregular applications. Multicore and GPU code optimization for finite volume computation.Numerical experiments investigating performance relative to irregularity.Detailed performance modeling based on CPU and GPU architecture.Generalized performance model for identifying bottlenecks in irregular applications.

[1]  Francisco Vázquez,et al.  A new approach for sparse matrix vector product on NVIDIA GPUs , 2011, Concurr. Comput. Pract. Exp..

[2]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[3]  Richard W. Vuduc,et al.  Model-driven autotuning of sparse matrix-vector multiply on GPUs , 2010, PPoPP '10.

[4]  Dirk Roose,et al.  High-level strategies for parallel shared-memory sparse matrix – vector multiplication , 2012 .

[5]  Gerhard Wellein,et al.  A unified sparse matrix data format for modern processors with wide SIMD units , 2013, ArXiv.

[6]  José M. Mantas,et al.  GPU computing for shallow water flow simulation based on finite volume schemes , 2011 .

[7]  Hai Jin,et al.  Optimization of Sparse Matrix-Vector Multiplication with Variant CSR on GPUs , 2011, 2011 IEEE 17th International Conference on Parallel and Distributed Systems.

[8]  Michael Garland,et al.  Efficient Sparse Matrix-Vector Multiplication on CUDA , 2008 .

[9]  Dietmar Fey,et al.  High Performance Stencil Code Algorithms for GPGPUs , 2011, ICCS.

[10]  Gerhard Wellein,et al.  Sparse Matrix-vector Multiplication on GPGPU Clusters: A New Storage Format and a Scalable Implementation , 2011, 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum.

[11]  Ümit V. Çatalyürek,et al.  A fine-grain hypergraph model for 2D decomposition of sparse matrices , 2001, Proceedings 15th International Parallel and Distributed Processing Symposium. IPDPS 2001.

[12]  Andreas Moshovos,et al.  Demystifying GPU microarchitecture through microbenchmarking , 2010, 2010 IEEE International Symposium on Performance Analysis of Systems & Software (ISPASS).

[13]  Wei Zhang,et al.  Performance modeling of serial and parallel implementations of the fractional Adams-Bashforth-Moulton method , 2014 .

[14]  C. Aykanat Hypergraph Model for Mapping Repeated Sparse-Matrix Vector Product Computations onto Multicomputers , 1995 .

[15]  Brian Hamilton,et al.  ROOM ACOUSTICS MODELLING USING GPU-ACCELERATED FINITE DIFFERENCE AND FINITE VOLUME METHODS ON A FACE-CENTERED CUBIC GRID , 2013 .

[16]  Samuel Williams,et al.  Roofline: an insightful visual performance model for multicore architectures , 2009, CACM.

[17]  Tamara G. Kolda,et al.  An overview of the Trilinos project , 2005, TOMS.

[18]  Dongjian He,et al.  Hydraulic Erosion Simulation Using Finite Volume Method on Graphics Processing Unit , 2009, 2009 International Conference on Information Engineering and Computer Science.

[19]  G. Karypis,et al.  Multilevel k-way hypergraph partitioning , 1999, Proceedings 1999 Design Automation Conference (Cat. No. 99CH36361).

[20]  E. Cuthill,et al.  Reducing the bandwidth of sparse symmetric matrices , 1969, ACM '69.