Sparse Matrix Vector Processing Formats

In this dissertation we have identified vector processing shortcomings related to the efficient storing and processing of sparse matrices. To alleviate existent problems we propose two storage formats denoted as Block Based Compression Storage (BBCS) format and Hierarchical Sparse Matrix (HiSM) storage. Furthermore we propose vector architectural instruction set extensions and microarchitecture mechanisms to speed up frequently used sparse matrix operations using the proposed formats. Finally we identified the lack of benchmarks that cover both format and sparse matrix operations. We introduced a benchmark that covers both. To evaluate our proposal we developed a simulator based on SimpleScalar, extended so that it incorporates our proposed changes and established the following. Regarding storage space our proposed formats require 72% to 78% of the storage space needed for Compressed Row Storage (CRS) or the Jagged Diagonal (JD) storage, both commonly used sparse matrix storage formats. Regarding Sparse Matrix Vector Multiplication (SMVM) both BBCS and HiSM achieve a considerable performance speedup when compared to CRS and JD. More in particular, when performing the SMVM using the HiSM format and the newly proposed instructions we can achieve a speedup of 5.3 and 4.07 versus CRS and JD respectively. Additionally, the operation of element insertion using HiSM can be sped up by a factor of 2-400 depending on the sparsity of the matrix. Furthermore, we show that we can increase the performance of the transposition operation by a factor of 17.7 when compared to CRS.

[1]  Uri C. Weiser,et al.  MMX technology extension to the Intel architecture , 1996, IEEE Micro.

[2]  Roman Geus,et al.  Towards a fast parallel sparse matrix-vector multiplication , 2000, PARCO.

[3]  Hiroshi Okuda,et al.  Performance Optimization of GeoFEM Fluid Analysis Code on Various Computer Architectures , 2002 .

[4]  Ronald F. Boisvert,et al.  Developing numerical libraries in Java , 1998, Concurr. Pract. Exp..

[5]  P. Sadayappan,et al.  On improving the performance of sparse matrix-vector multiplication , 1997, Proceedings Fourth International Conference on High-Performance Computing.

[6]  Ernst Schrem,et al.  Computer Implementation of the Finite-Element Procedure , 1973 .

[7]  A. Lumsdaine,et al.  A Sparse Matrix Library in C + + for High PerformanceArchitectures , 1994 .

[8]  Stamatis Vassiliadis,et al.  Architectural Support for 3D Graphics in the Complex Streamed Instruction Set , 2002, IASTED PDCS.

[9]  Stamatis Vassiliadis,et al.  The MOLEN ρμ-coded processor , 2001 .

[10]  John Wawrzynek,et al.  Vector microprocessors , 1998 .

[11]  Mateo Valero,et al.  Decoupled vector architectures , 1996, Proceedings. Second International Symposium on High-Performance Computer Architecture.

[12]  Stamatis Vassiliadis,et al.  The Molen Programming Paradigm , 2004, SAMOS.

[13]  Jack J. Dongarra Performance of various computers using standard linear equations software in a Fortran environment , 1983, CARN.

[14]  Yousef Saad,et al.  A benchmark package for sparse matrix computations , 1988, ICS '88.

[15]  G. Amdhal,et al.  Validity of the single processor approach to achieving large scale computing capabilities , 1967, AFIPS '67 (Spring).

[16]  Stamatis Vassiliadis,et al.  The MOLEN polymorphic processor , 2004, IEEE Transactions on Computers.

[17]  Gerry Kane,et al.  MIPS RISC Architecture , 1987 .

[18]  Leonid Oliker,et al.  Evaluation of Cache-based Superscalar and Cacheless Vector Architectures for Scientific Computations , 2003, SC.

[19]  Mateo Valero,et al.  Simultaneous multithreaded vector architecture: merging ILP and DLP for high performance , 1997, Proceedings Fourth International Conference on High-Performance Computing.

[20]  Guy E. Blelloch,et al.  AD-A 270 601 Segmented Operations for Sparse Matrix Computation on Vector Multiprocessors , 1993 .

[21]  James Demmel,et al.  Performance Optimizations and Bounds for Sparse Matrix-Vector Multiply , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[22]  J. Z. Zhu,et al.  The finite element method , 1977 .

[23]  R. E. Kessler,et al.  Cray T3D: a new dimension for Cray Research , 1993, Digest of Papers. Compcon Spring.

[24]  Stamatis Vassiliadis Polymorphic Processors: How to Expose Arbitrary Hardware Functionality to Programmers , 2004, PACT 2004.

[25]  Alexandru Nicolau,et al.  Computing Programs Containing Band Linear Recurrences on Vector Supercomputers , 1996, IEEE Trans. Parallel Distributed Syst..

[26]  Katherine A. Yelick,et al.  Optimizing Sparse Matrix Computations for Register Reuse in SPARSITY , 2001, International Conference on Computational Science.

[27]  Iain S. Duff,et al.  Users' guide for the Harwell-Boeing sparse matrix collection (Release 1) , 1992 .

[28]  Geoffrey C. Fox,et al.  The Perfect Club Benchmarks: Effective Performance Evaluation of Supercomputers , 1989, Int. J. High Perform. Comput. Appl..

[29]  Y. Saad,et al.  Numerical solution of large nonsymmetric eigenvalue problems , 1989 .

[30]  Stamatis Vassiliadis,et al.  The MOLEN processor prototype , 2004, 12th Annual IEEE Symposium on Field-Programmable Custom Computing Machines.

[31]  Stamatis Vassiliadis,et al.  A Hierarchical sparse matrix storage format for vector processors , 2003, Proceedings International Parallel and Distributed Processing Symposium.

[32]  David H. Bailey,et al.  NAS parallel benchmark results , 1992, Proceedings Supercomputing '92.

[33]  Krste Asanovic,et al.  Torrent Architecture Manual , 1997 .

[34]  Youcef Saad,et al.  A Basic Tool Kit for Sparse Matrix Computations , 1990 .

[35]  Mateo Valero,et al.  Out-of-order vector architectures , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[36]  John Wawrzynek,et al.  T0: A Single-Chip Vector Microprocessor with Reconfigurable Pipelines , 1996, ESSCIRC '96: Proceedings of the 22nd European Solid-State Circuits Conference.

[37]  Werner Buchholz The IBM System/370 Vector Architecture , 1986, IBM Syst. J..

[38]  Peter M. Kogge,et al.  The Architecture of Pipelined Computers , 1981 .

[39]  Stamatis Vassiliadis,et al.  Performance of the Complex Streamed Instruction Set on Image Processing Kernels , 2001, Euro-Par.

[40]  Stamatis Vassiliadis,et al.  Sparse matrix transpose unit , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[41]  Stamatis Vassiliadis,et al.  Performance Scalability of Multimedia Instruction Set Extensions , 2002, Euro-Par.

[42]  Stamatis Vassiliadis,et al.  Implementation and evaluation of the Complex Streamed Instruction set , 2001, Proceedings 2001 International Conference on Parallel Architectures and Compilation Techniques.

[43]  Richard Vuduc,et al.  Automatic performance tuning of sparse matrix kernels , 2003 .

[44]  Richard Barrett,et al.  Templates for the Solution of Linear Systems: Building Blocks for Iterative Methods , 1994, Other Titles in Applied Mathematics.

[45]  David H. Bailey,et al.  The Nas Parallel Benchmarks , 1991, Int. J. High Perform. Comput. Appl..

[46]  Victor Eijkhout,et al.  LAPACK Working Note 50: Distributed Sparse Data Structures for Linear Algebra Operations , 1992 .

[47]  Sivan Toledo,et al.  Improving the memory-system performance of sparse-matrix vector multiplication , 1997, IBM J. Res. Dev..

[48]  Stamatis Vassiliadis,et al.  Implementation of a streaming execution unit , 2002, Proceedings Euromicro Symposium on Digital System Design. Architectures, Methods and Tools.

[49]  Stamatis Vassiliadis,et al.  Block Based Compression Storage Expected Performance , 2002 .

[50]  Richard F. Barrett,et al.  Matrix Market: a web resource for test matrix collections , 1996, Quality of Numerical Software.

[51]  Y. Saad,et al.  Krylov Subspace Methods on Supercomputers , 1989 .

[52]  David A. Patterson,et al.  Computer Architecture: A Quantitative Approach , 1969 .

[53]  W. J. Watson The TI ASC: a highly modular and flexible super computer architecture , 1972, AFIPS '72 (Fall, part I).

[54]  Richard M. Russell,et al.  The CRAY-1 computer system , 1978, CACM.

[55]  J.S.S.M. Wong,et al.  Microcoded Reconfigurable Embedded Processors , 2002 .

[56]  Brian B. Moore,et al.  The IBM System/370 Vector Architecture: Design Considerations , 1988, IEEE Trans. Computers.

[57]  Wai-Mee Ching,et al.  Sparse matrix technology tools in APL , 1990 .

[58]  Kesheng Wu,et al.  A Revised Proposal for a Sparse BLAS Toolkit , 1994 .

[59]  Yousef Saad,et al.  SPARK: a benchmark package for sparse computations , 1990, ICS '90.

[60]  Stamatis Vassiliadis,et al.  D-SAB: A Sparse Matrix Benchmark Suite , 2003, PaCT.

[61]  Todd M. Austin,et al.  The SimpleScalar tool set, version 2.0 , 1997, CARN.

[62]  A. Pinar,et al.  Improving Performance of Sparse Matrix-Vector Multiplication , 1999, ACM/IEEE SC 1999 Conference (SC'99).