Expressing Sparse Matrix Computations for Productive Performance on Spatial Architectures

This paper addresses spatial programming of sparse matrix computations for productive performance. The challenge is how to express an irregular computation and its optimizations in a regular way. A sparse matrix has (non-zero) values and a structure. In this paper, we propose to classify the implementations of a computation on a sparse matrix into two categories: (1) structure-driven, or top-down, approach, which traverses the structure with given row and column indices and locates the corresponding values, and (2) values-driven, or bottom-up, approach, which loads and processes the values in parallel streams, and decodes the structure for the values' corresponding row and column indices. On a spatial architecture like FPGAs, the values-driven approach is the norm. We show how to express a sparse matrix computation and its optimizations for a values-driven implementation. A compiler automatically synthesizes a code to decode the structure. In this way, programmers focus on optimizing the processing of the values, using familiar optimizations for dense matrices, while leaving the complex, irregular structure traversal to an automatic compiler. We also attempt to regularize the optimizations of the reduction for a dynamic number of values, which is common in a sparse matrix computation.

[1]  Hongbo Rong,et al.  Programmatic Control of a Compiler for Generating High-performance Spatial Hardware , 2017, ArXiv.

[2]  Johannes Hölzl,et al.  Specifying and verifying sparse matrix codes , 2010, ICFP '10.

[3]  Henk Corporaal,et al.  Coarse grained reconfigurable architectures in the past 25 years: Overview and classification , 2016, 2016 International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation (SAMOS).

[4]  Michael Garland,et al.  Implementing sparse matrix-vector multiplication on throughput-oriented processors , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[5]  Luke N. Olson,et al.  Exposing Fine-Grained Parallelism in Algebraic Multigrid Methods , 2012, SIAM J. Sci. Comput..

[6]  Joseph L. Greathouse,et al.  Efficient Sparse Matrix-Vector Multiplication on GPUs Using the CSR Storage Format , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[7]  Mehmet Deveci,et al.  Multi-threaded Sparse Matrix-Matrix Multiplication for Many-Core and GPU Architectures , 2018, Parallel Comput..

[8]  Xuan Yang,et al.  Programming Heterogeneous Systems from an Image Processing DSL , 2016, ACM Trans. Archit. Code Optim..

[9]  Fred G. Gustavson,et al.  Two Fast Algorithms for Sparse Matrices: Multiplication and Permuted Transposition , 1978, TOMS.

[10]  Frédo Durand,et al.  Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines , 2013, PLDI 2013.

[11]  E. Cuthill,et al.  Reducing the bandwidth of sparse symmetric matrices , 1969, ACM '69.

[12]  Hongbo Rong,et al.  Sparso: Context-driven optimizations of sparse linear algebra , 2016, 2016 International Conference on Parallel Architecture and Compilation Techniques (PACT).

[13]  Karin Strauss,et al.  A High Memory Bandwidth FPGA Accelerator for Sparse Matrix-Vector Multiplication , 2014, 2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines.

[14]  Shoaib Kamil,et al.  The tensor algebra compiler , 2017, Proc. ACM Program. Lang..

[15]  Yao Zhang,et al.  Scan primitives for GPU computing , 2007, GH '07.

[16]  Alan George,et al.  Computer Solution of Large Sparse Positive Definite , 1981 .

[17]  Viktor K. Prasanna,et al.  Sparse Matrix-Vector multiplication on FPGAs , 2005, FPGA '05.

[18]  Kunle Olukotun,et al.  Spatial: a language and compiler for application accelerators , 2018, PLDI.