Run-Time Optimization of Sparse Matrix-Vector Multiplication on SIMD Machines

Sparse matrix-vector multiplication forms the heart of iterative linear solvers used widely in scientific computations (e.g., finite element methods). In such solvers, the matrix-vector product is computed repeatedly, often thousands of times, with updated values of the vector until convergence is achieved. In an SIMD architecture, each processor has to fetch the updated off-processor vector elements while computing its share of the product. In this paper, we report on run-time optimization of array distribution and off-processor data fetching to reduce both the communication and computation time. The optimization is applied to a sparse matrix stored in a compressed sparse row-wise format. Actual runs on test matrices produced up to a 35 percent relative improvement over a block distribution with a naive multiplication algorithm while simulations over a wider range of processors indicate that up to a 60 percent improvement may be possible in some cases.

[1]  David M. Nicol,et al.  Rectilinear Partitioning of Irregular Data Parallel Computations , 1994, J. Parallel Distributed Comput..

[2]  Donald J. Rose,et al.  Symposium on Sparse Matrices and Their Applications , 1972 .

[3]  Richard Rosen Matrix bandwidth minimization , 1968, ACM National Conference.

[4]  David R. Kincaid,et al.  Algorithm 586: ITPACK 2C: A FORTRAN Package for Solving Large Sparse Linear Systems by Adaptive Accelerated Iterative Methods , 1982, TOMS.

[5]  Harry Berryman,et al.  Performance of Hashed Cache Data Migration Schemes on Multicomputers , 1991, J. Parallel Distributed Comput..

[6]  C. Ozturan,et al.  Adaptive methods and rectangular partitioning problem , 1992, Proceedings Scalable High Performance Computing Conference SHPCC-92..

[7]  Shahid H. Bokhari,et al.  On the Mapping Problem , 1981, IEEE Transactions on Computers.

[8]  Elizabeth Cuthill,et al.  Several Strategies for Reducing the Bandwidth of Matrices , 1972 .

[9]  Peter Brezany,et al.  Vienna Fortran - A Language Specification. Version 1.1 , 1992 .

[10]  Robert Schreiber,et al.  Scalability of Sparse Direct Solvers , 1993 .

[11]  Harry Berryman,et al.  Run-Time Scheduling and Execution of Loops on Message Passing Machines , 1990, J. Parallel Distributed Comput..

[12]  Harry Berryman,et al.  Distributed Memory Compiler Design for Sparse Problems , 1995, IEEE Trans. Computers.

[13]  Rice UniversityCORPORATE,et al.  High performance Fortran language specification , 1993 .

[14]  Bhagirath Narahari,et al.  Algorithms for Mapping and Partitioning Chain Structured Parallel Computations , 1991, ICPP.

[15]  Steven Warren Hammond,et al.  Mapping unstructured grid computations to massively parallel computers , 1992 .

[16]  John G. Lewis,et al.  Sparse matrix test problems , 1982, SGNM.

[17]  George Karypis,et al.  Introduction to Parallel Computing , 1994 .