Automated Tiling of Unstructured Mesh Computations with Application to Seismological Modeling

Sparse tiling is a technique to fuse loops that access common data, thus increasing data locality. Unlike traditional loop fusion or blocking, the loops may have different iteration spaces and access shared datasets through indirect memory accesses, such as A[map[i]]—hence the name “sparse.” One notable example of such loops arises in discontinuous-Galerkin finite element methods, because of the computation of numerical integrals over different domains (e.g., cells, facets). The major challenge with sparse tiling is implementation—not only is it cumbersome to understand and synthesize, but it is also onerous to maintain and generalize, as it requires a complete rewrite of the bulk of the numerical computation. In this article, we propose an approach to extend the applicability of sparse tiling based on raising the level of abstraction. Through a sequence of compiler passes, the mathematical specification of a problem is progressively lowered, and eventually sparse-tiled C for-loops are generated. Besides automation, we advance the state-of-the-art by introducing a revisited, more efficient sparse tiling algorithm; support for distributed-memory parallelism; a range of fine-grained optimizations for increased runtime performance; implementation in a publicly available library, SLOPE; and an in-depth study of the performance impact in Seigen, a real-world elastic wave equation solver for seismological problems, which shows speed-ups up to 1.28× on a platform consisting of 896 Intel Broadwell cores.

[1]  Kevin Skadron,et al.  Performance modeling and automatic ghost zone optimization for iterative stencil loops on GPUs , 2009, ICS.

[2]  Michelle Mills Strout,et al.  Executing Optimized Irregular Applications Using Task Graphs within Existing Parallel Models , 2012, 2012 SC Companion: High Performance Computing, Networking Storage and Analysis.

[3]  James Demmel,et al.  Avoiding communication in sparse matrix computations , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[4]  Matthew G. Knepley,et al.  firedrakeproject/petsc4py: The Python interface to PETSc , 2017 .

[5]  Florian Rathgeber Productive and efficient computational science through domain-specific abstractions , 2014 .

[6]  Andrew T. T. McRae,et al.  PyOP2: Framework for performance-portable parallel computations on unstructured meshes , 2016 .

[7]  Andrew T. T. McRae,et al.  Firedrake: automating the finite element method by composing abstractions , 2015, ACM Trans. Math. Softw..

[8]  J. Virieux P-SV wave propagation in heterogeneous media: Velocity‐stress finite‐difference method , 1986 .

[9]  Lawrence Mitchell,et al.  tsfc: The Two Stage Form Compiler , 2016 .

[10]  Gihan R. Mudalige,et al.  Loop Tiling in Large-Scale Stencil Codes at Run-Time with OPS , 2017, IEEE Transactions on Parallel and Distributed Systems.

[11]  Victor Eijkhout,et al.  firedrakeproject/petsc: Portable, Extensible Toolkit for Scientific Computation , 2017 .

[12]  Ivan Yashchuk,et al.  Firedrakeproject/Fiat: The Finite Element Automated Tabulator , 2017 .

[13]  Xing Zhou,et al.  Hierarchical overlapped tiling , 2012, CGO '12.

[14]  Joel H. Saltz,et al.  Run-time parallelization and scheduling of loops , 1989, SPAA '89.

[15]  Thomas H. Gibson,et al.  firedrakeproject/tsfc: The Two Stage Form Compiler , 2017 .

[16]  Anthony T. Chronopoulos s-Step Iterative Methods for (Non) Symmetric (In) Definite Linear Systems , 1989, PPSC.

[17]  James Demmel,et al.  Minimizing communication in sparse matrix solvers , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[18]  Uday Bondhugula,et al.  Effective automatic parallelization of stencil computations , 2007, PLDI '07.

[19]  Kei Davis,et al.  Optimizing Transformations of Stencil Operations for Parallel Object-Oriented Scientific Frameworks on Cache-Based Architectures , 1998, ISCOPE.

[20]  Michael Lange,et al.  Coneoproject/Coffee: A Compiler For Fast Expression Evaluation , 2017 .

[21]  Uday Bondhugula,et al.  Tiling and optimizing time-iterated computations over periodic domains , 2014, 2014 23rd International Conference on Parallel Architecture and Compilation (PACT).

[22]  W. Garvin,et al.  Exact transient solution of the buried line source problem , 1956, Proceedings of the Royal Society of London. Series A. Mathematical and Physical Sciences.

[23]  Paul H. J. Kelly,et al.  Acceleration of a Full-Scale Industrial CFD Application with OP2 , 2014, IEEE Transactions on Parallel and Distributed Systems.

[24]  Vipin Kumar,et al.  A Parallel Algorithm for Multilevel Graph Partitioning and Sparse Matrix Ordering , 1998, J. Parallel Distributed Comput..

[25]  Matthew G. Knepley,et al.  Efficient Mesh Management in Firedrake Using PETSc DMPlex , 2015, SIAM J. Sci. Comput..

[26]  Paul H. J. Kelly,et al.  An Analytical Study of Loop Tiling for a Large-Scale Unstructured Mesh Application , 2012, 2012 SC Companion: High Performance Computing, Networking Storage and Analysis.

[27]  Larry Carter,et al.  Sparse Tiling for Stationary Iterative Methods , 2004, Int. J. High Perform. Comput. Appl..

[28]  L. Fezoui,et al.  A high-order Discontinuous Galerkin method for the seismic wave propagation , 2009 .

[29]  Denys Dutykh,et al.  The VOLNA code for the numerical modeling of tsunami waves: Generation, propagation and inundation , 2010, 1002.4553.

[30]  Li Chen,et al.  Redundant computation partition on distributed-memory systems , 2002, Fifth International Conference on Algorithms and Architectures for Parallel Processing, 2002. Proceedings..

[31]  Samuel Williams,et al.  Loop Chaining: A Programming Abstraction for Balancing Locality and Parallelism , 2013, 2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum.

[32]  J. Demmel,et al.  Parallel Multigrid Solver for 3D Unstructured Finite Element Problems , 1999, ACM/IEEE SC 1999 Conference (SC'99).

[33]  J. Ramanujam,et al.  Generalizing Run-Time Tiling with the Loop Chain Abstraction , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[34]  J. Ramanujam,et al.  Code generation for parallel execution of a class of irregular loops on distributed memory systems , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[35]  Larry Carter,et al.  Compile-time composition of run-time data and iteration reorderings , 2003, PLDI '03.

[36]  Joel H. Saltz,et al.  Run-Time Parallelization and Scheduling of Loops , 1991, IEEE Trans. Computers.

[37]  Tuomas Kärnä,et al.  firedrakeproject/firedrake: an automated finite element system , 2017 .

[38]  Matthew G. Knepley,et al.  Unstructured Overlapping Mesh Distribution in Parallel , 2015, ArXiv.

[39]  Paul H. J. Kelly,et al.  Performance analysis of the OP2 framework on many-core architectures , 2011, PERV.

[40]  Ulrich Rüde,et al.  Cache Optimization for Structured and Unstructured Grid Multigrid , 2000 .

[41]  Larry Carter,et al.  Combining Performance Aspects of Irregular Gauss-Seidel Via Sparse Tiling , 2002, LCPC.

[42]  Uday Bondhugula,et al.  A practical automatic polyhedral parallelizer and locality optimizer , 2008, PLDI '08.