Automatic Performance Tuning and Analysis of Sparse Triangular Solve

We address the problem of building high-performance uniprocessor implementations of sparse triangular solve (SpTS) automatically. This computational kernel is often the bottleneck in a variety of scientific and engineering applications that require the direct solution of sparse linear systems. Performance tuning of SpTS—and sparse matrix kernels in general—is a tedious and time-consuming task, because performance depends on the complex interaction of many factors: the performance gap between processors and memory, the limits on the scope of compiler analyses and transformations, and the overhead of manipulating sparse data structures. Consequently, it is not unusual to see kernels such as SpTS run at under 10% of peak uniprocessor floating point performance. Our approach to automatic tuning of SpTS builds on prior experience with building tuning systems for sparse matrix-vector multiply (SpM×V) [21, 22, 40], and dense matrix kernels [8, 41]. In particular, we adopt the two-step methodology of previous approaches: (1) we identify and generate a set of reasonable candidate implementations, and (2) search this set for the fastest implementation by some combination of performance modeling and actually executing the implementations. In this paper, we consider the solution of the sparse lower triangular system Lx = y for a single dense vector x, given the lower triangular sparse matrix L and dense vector y. We refer to x as the solution vector and y as the right-hand side (RHS). Many of the lower triangular factors we have observed from sparse LU factorization have a large, dense triangle in the lower right-hand corner of the matrix; this trailing triangle can account for as much as 90% of the matrix non-zeros. Therefore, we consider both algorithmic and data structure reorganizations which partition the solve into a sparse phase and a dense phase. To the sparse phase, we adapt the register blocking optimization, previously proposed for sparse matrix-vector multiply (SpM×V) in the Sparsity system [21, 22], to the SpTS kernel; to the dense phase, we make judicious use of highly tuned BLAS routines by switching to a dense implementation (switch-to-dense optimization). We describe fully automatic hybrid off-line/on-line heuristics for selecting the key tuning parameters: the register block size and the point at which to use the dense algorithm. (See Section 2.) We then evaluate the performance of our optimized implementations relative to the fundamental limits on performance. Specifically, we first derive simple models of the upper bounds on the execution rate (Mflop/s) of our implementations. Using hardware counter data collected with the PAPI library [10], we then verify our models on three hardware platforms (Table 1) and a set of triangular factors from applications (Table 2). We observe that our optimized implementations can achieve 80% or more of these bounds; furthermore, we observe speedups of up to 1.8x when both register blocking and switch-to-dense optimizations are applied. We also present preliminary results confirming that our heuristics choose reasonable values for the tuning parameters. These results support our prior findings with SpM×V [40], suggesting two new directions for performance enhancements: (1) the use of higher-level matrix structures (e.g., matrix reordering and multiple register block sizes), and (2) optimizing kernels with more opportunities for data reuse (e.g., multiplication and solve with multiple vectors, multiplication of AA by a vector).

[1]  Stefan Andersson,et al.  RS/6000 Scientific and Technical Computing: POWER3 Introduction and Tuning Guide , 1998 .

[2]  Edward Rothberg,et al.  Alternatives for Solving Sparse Triangular Systems on Distributed-Memory Multiprocessors , 1995, Parallel Comput..

[3]  Fred G. Gustavson,et al.  LAWRA: Linear Algebra with Recursive Algorithms , 2000, PARA.

[4]  Jack J. Dongarra,et al.  Automatically Tuned Linear Algebra Software , 1998, Proceedings of the IEEE/ACM SC98 Conference.

[5]  William Kahan,et al.  Document for the Basic Linear Algebra Subprograms (BLAS) standard: BLAS Technical Forum , 2001 .

[6]  Patrick Amestoy,et al.  A Fully Asynchronous Multifrontal Solver Using Distributed Dynamic Scheduling , 2001, SIAM J. Matrix Anal. Appl..

[7]  Yuefan Deng,et al.  New trends in high performance computing , 2001, Parallel Computing.

[8]  Todd L. Veldhuizen,et al.  Arrays in Blitz++ , 1998, ISCOPE.

[9]  Dragan Mirkovic,et al.  An adaptive software library for fast Fourier transforms , 2000, ICS '00.

[10]  Michael T. Heath,et al.  Performance of Parallel Sparse Triangular Solution , 1999 .

[11]  Sathish S. Vadhiyar,et al.  Towards an Accurate Model for Collective Communications , 2001, Int. J. High Perform. Comput. Appl..

[12]  Aart J. C. Bik,et al.  Automatic Nonzero Structure Analysis , 1999, SIAM J. Comput..

[13]  Rafael Hector Saavedra-Barrera,et al.  CPU performance evaluation and execution time prediction using narrow spectrum benchmarking , 1992 .

[14]  Monica S. Lam,et al.  A data locality optimizing algorithm , 1991, PLDI '91.

[15]  James Demmel,et al.  Performance Optimizations and Bounds for Sparse Matrix-Vector Multiply , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[16]  Richard F. Barrett,et al.  Matrix Market: a web resource for test matrix collections , 1996, Quality of Numerical Software.

[17]  Roldan Pozo,et al.  NIST sparse BLAS user's guide , 2001 .

[18]  Eun Im,et al.  Optimizing the Performance of Sparse Matrix-Vector Multiplication , 2000 .

[19]  Steven G. Johnson,et al.  FFTW: an adaptive software architecture for the FFT , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[20]  Siddhartha Chatterjee,et al.  Exact analysis of the cache behavior of nested loops , 2001, PLDI '01.

[21]  P. Mannucci,et al.  Abstract , 2003 .

[22]  Francisco F. Rivera,et al.  Modeling and Improving Locality for Irregular Problems: Sparse Matrix-Vector Product on Cache Memories as a Cache Study , 1999, HPCN Europe.

[23]  Anoop Gupta,et al.  Parallel ICCG on a hierarchical memory multiprocessor - Addressing the triangular solve bottleneck , 1990, Parallel Comput..

[24]  Jack J. Dongarra,et al.  Automated empirical optimizations of software and the ATLAS project , 2001, Parallel Comput..

[25]  Vipin Kumar,et al.  PSPASES: An Efficient and Scalable Parallel Sparse Direct Solver , 1999, PPSC.

[26]  Richard Barrett,et al.  Templates for the Solution of Linear Systems: Building Blocks for Iterative Methods , 1994, Other Titles in Applied Mathematics.

[27]  Padma Raghavan,et al.  Efficient Parallel Sparse Triangular Solution Using Selective Inversion , 1998, Parallel Process. Lett..

[28]  Ken Kennedy,et al.  Compiler blockability of numerical algorithms , 1992, Proceedings Supercomputing '92.

[29]  Paul Vinson Stodghill,et al.  A Relational Approach to the Automatic Generation of Sequential Sparse matrix Codes , 1997 .

[30]  Cleve Ashcraft,et al.  SPOOLES: An Object-Oriented Sparse Matrix Library , 1999, PPSC.

[31]  Timothy A. Davis,et al.  An Unsymmetric-pattern Multifrontal Method for Sparse Lu Factorization , 1993 .

[32]  Michele Colajanni,et al.  PSBLAS: a library for parallel linear algebra computation on sparse matrices , 2000, TOMS.

[33]  Vipin Kumar,et al.  Parallel Algorithms for Forward Elimination and Backward Substitution in Direct Solution of Sparse L , 1995 .

[34]  Fernando L. Alvarado,et al.  Optimal Parallel Solution of Sparse Triangular Systems , 1993, SIAM J. Sci. Comput..

[35]  Josep-Lluís Larriba-Pey,et al.  Block algorithms for sparse matrix computations on high performance workstations , 1996, ICS '96.

[36]  P. Sadayappan,et al.  On improving the performance of sparse matrix-vector multiplication , 1997, Proceedings Fourth International Conference on High-Performance Computing.

[37]  Jeremy G. Siek,et al.  A Rational Approach to Portable High Performance: The Basic Linear Algebra Instruction Set (BLAIS) and the Fixed Algorithm Size Template (FAST) Library , 1998, ECOOP Workshops.

[38]  José M. F. Moura,et al.  Fast Automatic Generation of DSP Algorithms , 2001, International Conference on Computational Science.

[39]  Katherine A. Yelick,et al.  Optimizing Sparse Matrix Computations for Register Reuse in SPARSITY , 2001, International Conference on Computational Science.

[40]  Jack J. Dongarra,et al.  A Scalable Cross-Platform Infrastructure for Application Performance Tuning Using Hardware Counters , 2000, ACM/IEEE SC 2000 Conference (SC'00).

[41]  A. Gupta,et al.  Parallel Algorithms for Forward and Back Substitution in Direct Solution of Sparse Linear Systems , 1995, Proceedings of the IEEE/ACM SC95 Conference.

[42]  James Demmel,et al.  A Supernodal Approach to Sparse Partial Pivoting , 1999, SIAM J. Matrix Anal. Appl..

[43]  James Demmel,et al.  Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology , 1997, ICS '97.

[44]  Robert A. van de Geijn,et al.  FLAME: Formal Linear Algebra Methods Environment , 2001, TOMS.

[45]  Eunice E. Santos Solving triangular linear systems in parallel using substitution , 1995, Proceedings.Seventh IEEE Symposium on Parallel and Distributed Processing.

[46]  Sharad Malik,et al.  Cache miss equations: a compiler framework for analyzing and tuning memory behavior , 1999, TOPL.

[47]  Chau-Wen Tseng,et al.  Improving data locality with loop transformations , 1996, TOPL.

[48]  Olivier Temam,et al.  Characterizing the behavior of sparse algorithms on caches , 1992, Proceedings Supercomputing '92.

[49]  Thomas F. Coleman,et al.  A parallel triangular solver for distributed-memory multiprocessor , 1988 .

[50]  Vipin Kumar,et al.  A high performance two dimensional scalable parallel algorithm for solving sparse triangular systems , 1997, Proceedings Fourth International Conference on High-Performance Computing.