Combining Performance Aspects of Irregular Gauss-Seidel Via Sparse Tiling

Finite Element problems are often solved using multigrid techniques. The most time consuming part of multigrid is the iterative smoother, such as Gauss-Seidel. To improve performance, iterative smoothers can exploit parallelism, intra-iteration data reuse, and inter-iteration data reuse. Current methods for parallelizing Gauss-Seidel on irregular grids, such as multi-coloring and owner-computes based techniques, exploit parallelism and possibly intra-iteration data reuse but not inter-iteration data reuse. Sparse tiling techniques were developed to improve intra-iteration and inter-iteration data locality in iterative smoothers. This paper describes how sparse tiling can additionally provide parallelism. Our results show the effectiveness of Gauss-Seidel parallelized with sparse tiling techniques on shared memory machines, specifically compared to owner-computes based Gauss-Seidel methods. The latter employ only parallelism and intra-iteration locality. Our results support the premise that better performance occurs when all three performance aspects (parallelism, intra-iteration, and inter-iteration data locality) are combined.

[1]  Ken Kennedy,et al.  Compiler blockability of numerical algorithms , 1992, Proceedings Supercomputing '92.

[2]  David G. Wonnacott,et al.  Achieving Scalable Locality with Time Skewing , 2002, International Journal of Parallel Programming.

[3]  Siddhartha Chatterjee,et al.  Cache-Efficient Multigrid Algorithms , 2004, Int. J. High Perform. Comput. Appl..

[4]  Michael Wolfe,et al.  Iteration Space Tiling for Memory Hierarchies , 1987, PPSC.

[5]  Mark F. Adams,et al.  Evaluation of three unstructured multigrid methods on 3D finite element problems in solid mechanics , 2000 .

[6]  William Pugh,et al.  Iteration Space Slicing for Locality , 1999, LCPC.

[7]  Joel H. Saltz,et al.  Run-time and compile-time support for adaptive irregular problems , 1994, Proceedings of Supercomputing '94.

[8]  Dawson R. Engler,et al.  Interface Compilation: Steps Toward Compiling Program Interfaces as Languages , 1999, IEEE Trans. Software Eng..

[9]  David A. Padua,et al.  MaJIC: compiling MATLAB for speed and responsiveness , 2002, PLDI '02.

[10]  Dennis Gannon,et al.  Strategies for cache and local memory management by global program transformation , 1988, J. Parallel Distributed Comput..

[11]  Eun Im,et al.  Optimizing the Performance of Sparse Matrix-Vector Multiplication , 2000 .

[12]  D. Qainlant,et al.  ROSE: Compiler Support for Object-Oriented Frameworks , 1999 .

[13]  Robert J. Fowler,et al.  Increasing Temporal Locality with Skewing and Recursive Blocking , 2001, International Conference on Software Composition.

[14]  Vipin Kumar,et al.  A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs , 1998, SIAM J. Sci. Comput..

[15]  Ken Kennedy,et al.  Improving memory hierarchy performance for irregular applications , 1999, ICS '99.

[16]  V. E. Henson,et al.  BoomerAMG: a parallel algebraic multigrid solver and preconditioner , 2002 .

[17]  Ulrich Rüde,et al.  Cache Optimization for Structured and Unstructured Grid Multigrid , 2000 .

[18]  Ken Kennedy,et al.  Improving cache performance in dynamic applications through data and computation reorganization at run time , 1999, PLDI '99.

[19]  Ken Kennedy,et al.  Optimizing strategies for telescoping languages: procedure strength reduction and procedure vectorization , 2001, ICS '01.

[20]  Kei Davis,et al.  Optimizing Transformations of Stencil Operations for Parallel Object-Oriented Scientific Frameworks on Cache-Based Architectures , 1998, ISCOPE.

[21]  Keshav Pingali,et al.  Data-centric multi-level blocking , 1997, PLDI '97.

[22]  Monica S. Lam,et al.  A data locality optimizing algorithm , 1991, PLDI '91.

[23]  Barry F. Smith,et al.  Domain Decomposition: Parallel Multilevel Methods for Elliptic Partial Differential Equations , 1996 .

[24]  Siddhartha Chatterjee,et al.  Cache-Efficient Multigrid Algorithms , 2001, Int. J. High Perform. Comput. Appl..

[25]  Larry Carter,et al.  Localizing non-affine array references , 1999, 1999 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.PR00425).

[26]  Chau-Wen Tseng,et al.  Improving data locality with loop transformations , 1996, TOPL.

[27]  Mark F. Adams A distributed memory unstructured gauss-seidel algorithm for multigrid smoothers , 2001, SC.

[28]  Larry Carter,et al.  Rescheduling for Locality in Sparse Matrix Computations , 2001, International Conference on Computational Science.

[29]  Calvin Lin,et al.  Customizing Software Libraries for Performance Portability , 2001, PPSC.

[30]  Zhiyuan Li,et al.  New tiling techniques to improve cache temporal locality , 1999, PLDI '99.

[31]  M. J. Hagger Automatic domain decomposition on unstructured grids (DOUG) , 1998, Advances in Computational Mathematics.

[32]  Richard Barrett,et al.  Templates for the Solution of Linear Systems: Building Blocks for Iterative Methods , 1994, Other Titles in Applied Mathematics.

[33]  Daniel J. Quinlan ROSE: Compiler Support for Object-Oriented Frameworks , 2000, Parallel Process. Lett..

[34]  Chau-Wen Tseng,et al.  A Comparison of Locality Transformations for Irregular Codes , 2000, LCR.

[35]  François Irigoin,et al.  Supernode partitioning , 1988, POPL '88.