Modeling the Performance of Geometric Multigrid Stencils on Multicore Computer Architectures

The basic building blocks of the classic geometric multigrid algorithm all have a low ratio of executed floating point operations per byte fetched from memory. On modern computer architectures, such computational kernels are typically bound by memory traffic and achieve only a small percentage of the theoretical peak floating point performance of the underlying hardware. We suggest the use of state-of-the-art (stencil) compiler techniques to improve the flop per byte ratio, also called the arithmetic intensity, of the steps in the algorithm. Our focus will be on the smoother which is a repeated stencil application. With a tiling approach based on the polyhedral loop optimization framework, data reuse in the smoother can be improved, leading to a higher effective arithmetic intensity. For an academic constant coefficient Poisson problem, we present a performance model for the multigrid $V$-cycle solver based on the tiled smoother. For increasing numbers of smoothing steps, there is a trade-off between the ...

[1]  Samuel Williams,et al.  Optimization of geometric multigrid for emerging multi- and manycore processors , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[2]  James Demmel,et al.  Avoiding communication in sparse matrix computations , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[3]  Pen-Chung Yew,et al.  Tile size selection revisited , 2013, ACM Trans. Archit. Code Optim..

[4]  Gerhard Wellein,et al.  Efficient multicore-aware parallelization strategies for iterative stencil computations , 2010, J. Comput. Sci..

[5]  Samuel Williams,et al.  Roofline: an insightful visual performance model for multicore architectures , 2009, CACM.

[6]  Anne Greenbaum,et al.  Iterative methods for solving linear systems , 1997, Frontiers in applied mathematics.

[7]  James Demmel,et al.  Minimizing communication in sparse matrix solvers , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[8]  Uday Bondhugula,et al.  Automatic Transformations for Communication-Minimized Parallelization and Locality Optimization in the Polyhedral Model , 2008, CC.

[9]  Richard Barrett,et al.  Templates for the Solution of Linear Systems: Building Blocks for Iterative Methods , 1994, Other Titles in Applied Mathematics.

[10]  William L. Briggs,et al.  A multigrid tutorial , 1987 .

[11]  Gabriel H. Loh,et al.  3D-Stacked Memory Architectures for Multi-core Processors , 2008, 2008 International Symposium on Computer Architecture.

[12]  Ulrich Rüde,et al.  Fixed and Adaptive Cache Aware Algorithms for Multigrid Methods , 2000 .

[13]  Gerhard Wellein,et al.  Introduction to High Performance Computing for Scientists and Engineers , 2010, Chapman and Hall / CRC computational science series.

[14]  Howard C. Elman,et al.  Finite Elements and Fast Iterative Solvers: with Applications in Incompressible Fluid Dynamics , 2014 .

[15]  Jack Dongarra,et al.  Scheduling dense linear algebra operations on multicore processors , 2010 .

[16]  Samuel Williams,et al.  Optimization and Performance Modeling of Stencil Computations on Modern Microprocessors , 2007, SIAM Rev..

[17]  Craig C. Douglas,et al.  Caching in with Multigrid Algorithms: Problems in Two Dimensions , 1996, Parallel Algorithms Appl..

[18]  Jan Treibig,et al.  Efficiency improvements of iterative numerical algorithms on modern architectures , 2008 .

[19]  Gerhard Wellein,et al.  Efficient Temporal Blocking for Stencil Computations by Multicore-Aware Wavefront Parallelization , 2009, 2009 33rd Annual IEEE International Computer Software and Applications Conference.

[20]  Christian Lengauer,et al.  Loop Parallelization in the Polytope Model , 1993, CONCUR.

[21]  Larry Carter,et al.  Sparse Tiling for Stationary Iterative Methods , 2004, Int. J. High Perform. Comput. Appl..

[22]  Andrew Selle,et al.  Efficient elasticity for character skinning with contact and collisions , 2011, SIGGRAPH 2011.

[23]  Uday Bondhugula,et al.  A practical automatic polyhedral parallelizer and locality optimizer , 2008, PLDI '08.

[24]  Helmar Burkhart,et al.  PATUS: A Code Generation and Autotuning Framework for Parallel Iterative Stencil Computations on Modern Microarchitectures , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[25]  Larry Carter,et al.  An approach for code generation in the Sparse Polyhedral Framework , 2016, Parallel Comput..

[26]  Yolande Berbers,et al.  Efficient Synchronization for Stencil Computations Using Dynamic Task Graphs , 2013, ICCS.

[27]  Bradley C. Kuszmaul,et al.  The pochoir stencil compiler , 2011, SPAA '11.

[28]  Mark Hoemmen,et al.  Communication-avoiding Krylov subspace methods , 2010 .

[29]  Wim Vanroose,et al.  Improving the arithmetic intensity of multigrid with the help of polynomial smoothers , 2012, Numer. Linear Algebra Appl..

[30]  Ulrich Rüde,et al.  Cache-Aware Multigrid Methods for Solving Poisson's Equation in Two Dimensions , 2000, Computing.

[31]  Ulrich Rüde,et al.  Cache Optimization for Structured and Unstructured Grid Multigrid , 2000 .

[32]  Yousef Saad,et al.  Iterative methods for sparse linear systems , 2003 .

[33]  David A. Patterson,et al.  Computer Organization and Design, Fourth Edition, Fourth Edition: The Hardware/Software Interface (The Morgan Kaufmann Series in Computer Architecture and Design) , 2008 .

[34]  Francky Catthoor,et al.  Polyhedral parallel code generation for CUDA , 2013, TACO.

[35]  Uday Bondhugula,et al.  Tiling stencil computations to maximize parallelism , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.