Sparse Tiling for Stationary Iterative Methods

In modern computers, a program’s data locality can affect performance significantly. This paper details full sparse tiling, a run-time reordering transformation that improves the data locality for stationary iterative methods such as Gauss–Seidel operating on sparse matrices. In scientific applications such as finite element analysis, these iterative methods dominate the execution time. Full sparse tiling chooses a permutation of the rows and columns of the sparse matrix, and then an order of execution that achieves better data locality. We prove that full sparsetiled Gauss–Seidel generates a solution that is bitwise identical to traditional Gauss–Seidel on the permuted matrix. We also present measurements of the performance improvements and the overheads of full sparse tiling and of cache blocking for irregular grids, a related technique developed by Douglas et al.

[1]  C. A. R. HOARE,et al.  An axiomatic basis for computer programming , 1969, CACM.

[2]  Dennis Gannon,et al.  Strategies for cache and local memory management by global program transformation , 1988, J. Parallel Distributed Comput..

[3]  Eun Im,et al.  Optimizing the Performance of Sparse Matrix-Vector Multiplication , 2000 .

[4]  George Karypis,et al.  Multilevel k-way Partitioning Scheme for Irregular Graphs , 1998, J. Parallel Distributed Comput..

[5]  William Pugh,et al.  A unifying framework for iteration reordering transformations , 1995, Proceedings 1st International Conference on Algorithms and Architectures for Parallel Processing.

[6]  Chau-Wen Tseng,et al.  Tiling Optimizations for 3D Scientific Computations , 2000, ACM/IEEE SC 2000 Conference (SC'00).

[7]  Larry Carter,et al.  Rescheduling for Locality in Sparse Matrix Computations , 2001, International Conference on Computational Science.

[8]  Zhiyuan Li,et al.  New tiling techniques to improve cache temporal locality , 1999, PLDI '99.

[9]  Keshav Pingali,et al.  Synthesizing transformations for locality enhancement of imperfectly-nested loop nests , 2000 .

[10]  Michael Wolfe,et al.  Iteration Space Tiling for Memory Hierarchies , 1987, PPSC.

[11]  Kang Su Gatlin,et al.  Architecture-Cognizant Divide and Conquer Algorithms , 1999, ACM/IEEE SC 1999 Conference (SC'99).

[12]  David S. Johnson,et al.  Some Simplified NP-Complete Graph Problems , 1976, Theor. Comput. Sci..

[13]  David G. Wonnacott,et al.  Achieving Scalable Locality with Time Skewing , 2002, International Journal of Parallel Programming.

[14]  Joel H. Saltz,et al.  Communication Optimizations for Irregular Scientific Computations on Distributed Memory Architectures , 1994, J. Parallel Distributed Comput..

[15]  Ulrich Rüde,et al.  Cache Optimization for Structured and Unstructured Grid Multigrid , 2000 .

[16]  Joel H. Saltz,et al.  Principles of runtime support for parallel processors , 1988, ICS '88.

[17]  Ken Kennedy,et al.  Improving cache performance in dynamic applications through data and computation reorganization at run time , 1999, PLDI '99.

[18]  William Pugh,et al.  Finding Legal Reordering Transformations Using Mappings , 1994, LCPC.

[19]  Larry Carter,et al.  Combining Performance Aspects of Irregular Gauss-Seidel Via Sparse Tiling , 2002, LCPC.

[20]  C. H. Wu A multicolour SOR method for the finite-element method , 1990 .

[21]  Ken Kennedy,et al.  Compiler blockability of numerical algorithms , 1992, Proceedings Supercomputing '92.

[22]  William Pugh,et al.  Nonlinear array dependence analysis , 1994 .

[23]  Chau-Wen Tseng,et al.  Improving data locality with loop transformations , 1996, TOPL.

[24]  Mark F. Adams A distributed memory unstructured gauss-seidel algorithm for multigrid smoothers , 2001, SC.

[25]  Michael Wolfe,et al.  More iteration space tiling , 1989, Proceedings of the 1989 ACM/IEEE Conference on Supercomputing (Supercomputing '89).

[26]  Larry Carter,et al.  Performance transformations for irregular applications , 2003 .

[27]  Robert J. Fowler,et al.  Increasing Temporal Locality with Skewing and Recursive Blocking , 2001, International Conference on Software Composition.

[28]  William Pugh,et al.  SIPR: A New Framework for Generating Efficient Code for Sparse Matrix Computations , 1998, LCPC.

[29]  Siddhartha Chatterjee,et al.  Cache-Efficient Multigrid Algorithms , 2004, Int. J. High Perform. Comput. Appl..

[30]  Monica S. Lam,et al.  A data locality optimizing algorithm , 1991, PLDI '91.

[31]  Kei Davis,et al.  Optimizing Transformations of Stencil Operations for Parallel Object-Oriented Scientific Frameworks on Cache-Based Architectures , 1998, ISCOPE.

[32]  Chau-Wen Tseng,et al.  Efficient compiler and run-time support for parallel irregular reductions , 2000, Parallel Comput..

[33]  François Irigoin,et al.  Supernode partitioning , 1988, POPL '88.

[34]  Larry Carter,et al.  Localizing non-affine array references , 1999, 1999 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.PR00425).

[35]  William Pugh,et al.  Iteration Space Slicing for Locality , 1999, LCPC.

[36]  Jack J. Dongarra,et al.  End-user Tools for Application Performance Analysis Using Hardware Counters , 2001, ISCA PDCS.

[37]  Richard Barrett,et al.  Templates for the Solution of Linear Systems: Building Blocks for Iterative Methods , 1994, Other Titles in Applied Mathematics.

[38]  C. A. R. Hoare,et al.  An axiomatic basis for computer programming , 1969, CACM.

[39]  Katherine A. Yelick,et al.  Optimizing Sparse Matrix Computations for Register Reuse in SPARSITY , 2001, International Conference on Computational Science.

[40]  Ken Kennedy,et al.  Improving Memory Hierarchy Performance for Irregular Applications Using Data and Computation Reorderings , 2001, International Journal of Parallel Programming.

[41]  Keshav Pingali,et al.  Data-centric multi-level blocking , 1997, PLDI '97.