SpArch: Efficient Architecture for Sparse Matrix Multiplication

Generalized Sparse Matrix-Matrix Multiplication (SpGEMM) is a ubiquitous task in various engineering and scientific applications. However, inner product based SpGEMM introduces redundant input fetches for mismatched nonzero operands, while outer product based approach suffers from poor output locality due to numerous partial product matrices. Inefficiency in the reuse of either inputs or outputs data leads to extensive and expensive DRAM access. To address this problem, this paper proposes an efficient sparse matrix multiplication accelerator architecture, SpArch, which jointly optimizes the data locality for both input and output matrices. We first design a highly parallelized streaming-based merger to pipeline the multiply and merge stage of partial matrices so that partial matrices are merged on chip immediately after produced. We then propose a condensed matrix representation that reduces the number of partial matrices by three orders of magnitude and thus reduces DRAM access by 5.4x. We further develop a Huffman tree scheduler to improve the scalability of the merger for larger sparse matrices, which reduces the DRAM access by another 1.8x. We also resolve the increased input matrix read induced by the new representation using a row prefetcher with near-optimal buffer replacement policy, further reducing the DRAM access by 1.5x. Evaluated on 20 benchmarks, SpArch reduces the total DRAM access by 2.8x over previous state-of-the-art. On average, SpArch achieves 4x, 19x, 18x, 17x, 1285x speedup and 6x, 164x, 435x, 307x, 62x energy savings over OuterSpace, MKL, cuSPARSE, CUSP, and ARM Armadillo, respectively.

[1]  Gerald Penn,et al.  Efficient transitive closure of sparse matrices over closed semirings , 2006, Theor. Comput. Sci..

[2]  J. Hennessy A new golden age for computer architecture: Domain-specific hardware/software co-design, enhanced security, open instruction sets, and agile chip development , 2018, 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA).

[3]  Gang Wang,et al.  Fast lists intersection with Bloom filter using graphics processing units , 2011, SAC '11.

[4]  Brian Vinter,et al.  An Efficient GPU General Sparse Matrix-Matrix Multiplication for Irregular Data , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[5]  Brian W. Barrett,et al.  Introducing the Graph 500 , 2010 .

[6]  Yong Dou,et al.  High performance sparse matrix-vector multiplication on FPGA , 2013, IEICE Electron. Express.

[7]  Mehmet Deveci,et al.  Sparse Matrix-Matrix Multiplication for Modern Architectures , 2016 .

[8]  Yuan Xie,et al.  Sparse Tensor Core: Algorithm and Hardware Co-Design for Vector-wise Sparse Neural Networks on Modern GPUs , 2019, MICRO.

[9]  Santiago Badia,et al.  A Highly Scalable Parallel Implementation of Balancing Domain Decomposition by Constraints , 2014, SIAM J. Sci. Comput..

[10]  Endong Wang,et al.  Intel Math Kernel Library , 2014 .

[11]  Song Han,et al.  Learning both Weights and Connections for Efficient Neural Network , 2015, NIPS.

[12]  Jeff Rearick,et al.  Unleashing Fury: A New Paradigm for 3-D Design and Test , 2017, IEEE Design & Test.

[13]  Song Han,et al.  ESE: Efficient Speech Recognition Engine with Sparse LSTM on FPGA , 2016, FPGA.

[14]  Joan Antoni Sellarès,et al.  Intersecting two families of sets on the GPU , 2017, J. Parallel Distributed Comput..

[15]  Satoshi Itoh,et al.  Order-N tight-binding molecular dynamics on parallel computers , 1995 .

[16]  Ngai Wong,et al.  Design space exploration for sparse matrix-matrix multiplication on FPGAs , 2010, 2010 International Conference on Field-Programmable Technology.

[17]  Christopher Robert Cullinan,et al.  Computing Performance Benchmarks among CPU, GPU, and FPGA , 2012 .

[18]  Warren J. Gross,et al.  FPGA architecture and implementation of sparse matrix-vector multiplication for the finite element method , 2008, Comput. Phys. Commun..

[19]  Jason Cong,et al.  Understanding Performance Differences of FPGAs and GPUs , 2018, 2018 IEEE 26th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM).

[20]  John R. Gilbert,et al.  A Unified Framework for Numerical and Combinatorial Computing , 2008, Computing in Science & Engineering.

[21]  H. T. Kung Why systolic architectures? , 1982, Computer.

[22]  Song Han,et al.  EIE: Efficient Inference Engine on Compressed Deep Neural Network , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[23]  Conrad Sanderson,et al.  Practical Sparse Matrices in C++ with Hybrid Storage and Template-Based Expression Optimisation , 2018 .

[24]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[25]  Conrad Sanderson,et al.  Armadillo: a template-based C++ library for linear algebra , 2016, J. Open Source Softw..

[26]  G.E. Moore,et al.  Cramming More Components Onto Integrated Circuits , 1998, Proceedings of the IEEE.

[27]  Mark Horowitz,et al.  Energy-Efficient Floating-Point Unit Design , 2011, IEEE Transactions on Computers.

[28]  Lei Zou,et al.  Speeding Up Set Intersections in Graph Algorithms using SIMD Instructions , 2018, SIGMOD Conference.

[29]  Samuel Williams,et al.  Exploiting Multiple Levels of Parallelism in Sparse Matrix-Matrix Multiplication , 2015, SIAM J. Sci. Comput..

[30]  Ernest Jamro,et al.  The Algorithms for FPGA Implementation of Sparse Matrices Multiplication , 2014, Comput. Informatics.

[31]  Aamer Jaleel,et al.  ExTensor: An Accelerator for Sparse Tensor Algebra , 2019, MICRO.

[32]  Song Han,et al.  AMC: AutoML for Model Compression and Acceleration on Mobile Devices , 2018, ECCV.

[33]  William J. Dally,et al.  SCNN: An accelerator for compressed-sparse convolutional neural networks , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[34]  T. N. Vijaykumar,et al.  SparTen: A Sparse Tensor Accelerator for Convolutional Neural Networks , 2019, MICRO.

[35]  Wayne Luk,et al.  Optimising Sparse Matrix Vector multiplication for large scale FEM problems on FPGA , 2016, 2016 26th International Conference on Field Programmable Logic and Applications (FPL).

[36]  Vijay V. Vazirani,et al.  Maximum Matchings in General Graphs Through Randomization , 1989, J. Algorithms.

[37]  Yiran Chen,et al.  Learning Structured Sparsity in Deep Neural Networks , 2016, NIPS.

[38]  Nectarios Koziris,et al.  Understanding the Performance of Sparse Matrix-Vector Multiplication , 2008, 16th Euromicro Conference on Parallel, Distributed and Network-Based Processing (PDP 2008).

[39]  S. Dongen Graph clustering by flow simulation , 2000 .

[40]  David Blaauw,et al.  OuterSPACE: An Outer Product Based Sparse Matrix Multiplication Accelerator , 2018, 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[41]  Fred G. Gustavson,et al.  Two Fast Algorithms for Sparse Matrices: Multiplication and Permuted Transposition , 1978, TOMS.

[42]  Ichitaro Yamazaki,et al.  On Techniques to Improve Robustness and Scalability of a Parallel Hybrid Linear Solver , 2010, VECPAR.

[43]  John R. Gilbert,et al.  Parallel Triangle Counting and Enumeration Using Matrix Algebra , 2015, 2015 IEEE International Parallel and Distributed Processing Symposium Workshop.

[44]  Timothy A. Davis,et al.  The university of Florida sparse matrix collection , 2011, TOMS.

[45]  Gang Wang,et al.  Efficient Parallel Lists Intersection and Index Compression Algorithms using Graphics Processing Units , 2011, Proc. VLDB Endow..

[46]  Vipin Kumar,et al.  A parallel formulation of interior point algorithms , 1994, Proceedings of Supercomputing '94.

[47]  Viktor K. Prasanna,et al.  Sparse Matrix-Vector multiplication on FPGAs , 2005, FPGA '05.

[48]  Song Han,et al.  Learning to Design Circuits , 2018, ArXiv.

[49]  Timothy M. Chan More algorithms for all-pairs shortest paths in weighted graphs , 2007, STOC '07.

[50]  Tim Kraska,et al.  Park: An Open Platform for Learning-Augmented Computer Systems , 2019, NeurIPS.

[51]  Song Han,et al.  Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding , 2015, ICLR.

[52]  Jure Leskovec,et al.  {SNAP Datasets}: {Stanford} Large Network Dataset Collection , 2014 .

[53]  John R. Gilbert,et al.  High-Performance Graph Algorithms from Parallel Sparse Matrices , 2006, PARA.

[54]  Tsutomu Maruyama,et al.  Performance comparison of FPGA, GPU and CPU in image processing , 2009, 2009 International Conference on Field Programmable Logic and Applications.

[55]  Luke N. Olson,et al.  Exposing Fine-Grained Parallelism in Algebraic Multigrid Methods , 2012, SIAM J. Sci. Comput..