A 7.3 M Output Non-Zeros/J, 11.7 M Output Non-Zeros/GB Reconfigurable Sparse Matrix–Matrix Multiplication Accelerator

A sparse matrix–matrix multiplication (SpMM) accelerator with 48 heterogeneous cores and a reconfigurable memory hierarchy is fabricated in 40-nm CMOS. The compute fabric consists of dedicated floating-point multiplication units, and general-purpose Arm Cortex-M0 and Cortex-M4 cores. The on-chip memory reconfigures scratchpad or cache, depending on the phase of the algorithm. The memory and compute units are interconnected with synthesizable coalescing crossbars for efficient memory access. The 2.0-mm <inline-formula> <tex-math notation="LaTeX">$\times $ </tex-math></inline-formula> 2.6-mm chip exhibits 12.6<inline-formula> <tex-math notation="LaTeX">$\times $ </tex-math></inline-formula> (8.4<inline-formula> <tex-math notation="LaTeX">$\times $ </tex-math></inline-formula>) energy efficiency gain, 11.7<inline-formula> <tex-math notation="LaTeX">$\times $ </tex-math></inline-formula> (77.6<inline-formula> <tex-math notation="LaTeX">$\times $ </tex-math></inline-formula>) off-chip bandwidth efficiency gain, and 17.1<inline-formula> <tex-math notation="LaTeX">$\times $ </tex-math></inline-formula> (36.9<inline-formula> <tex-math notation="LaTeX">$\times $ </tex-math></inline-formula>) compute density gain s against a high-end CPU (GPU) across a diverse set of synthetic and real-world power-law graph-based sparse matrices.

[1]  Gene Poole,et al.  Accelerating the ANSYS Direct Sparse Solver with GPUs , 2011 .

[2]  Constantine Bekas,et al.  Analyzing the energy-efficiency of sparse matrix multiplication on heterogeneous systems: A comparative study of GPU, Xeon Phi and FPGA , 2016, 2016 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[3]  Ngai Wong,et al.  Design space exploration for sparse matrix-matrix multiplication on FPGAs , 2010, FPT.

[4]  John R. Gilbert,et al.  An interactive system for combinatorial scientific computing with an emphasis on programmer productivity , 2007 .

[5]  Pradeep Dubey,et al.  Navigating the maze of graph analytics frameworks using massive graph datasets , 2014, SIGMOD Conference.

[6]  Francisco Vázquez,et al.  Fast Sparse Matrix Matrix Product Based on ELLR-T and GPU Computing , 2012, 2012 IEEE 10th International Symposium on Parallel and Distributed Processing with Applications.

[7]  DaltonSteven,et al.  Optimizing Sparse MatrixMatrix Multiplication for the GPU , 2015 .

[8]  John R. Gilbert,et al.  A Unified Framework for Numerical and Combinatorial Computing , 2008, Computing in Science & Engineering.

[9]  John R. Gilbert,et al.  Parallel Sparse Matrix-Matrix Multiplication and Indexing: Implementation and Experiments , 2011, SIAM J. Sci. Comput..

[10]  John R. Gilbert,et al.  Parallel Triangle Counting and Enumeration Using Matrix Algebra , 2015, 2015 IEEE International Parallel and Distributed Processing Symposium Workshop.

[11]  Richard Dorrance,et al.  A 190GFLOPS/W DSP for energy-efficient sparse-BLAS in embedded IoT , 2016, 2016 IEEE Symposium on VLSI Circuits (VLSI-Circuits).

[12]  Ümit V. Çatalyürek,et al.  Performance Evaluation of Sparse Matrix Multiplication Kernels on Intel Xeon Phi , 2013, PPAM.

[13]  Haim Kaplan,et al.  Colored intersection searching via sparse rectangular matrix multiplication , 2006, SCG '06.

[14]  John R. Gilbert,et al.  High-Performance Graph Algorithms from Parallel Sparse Matrices , 2006, PARA.

[15]  Bülent Yener,et al.  Graph Theoretic and Spectral Analysis of Enron Email Data , 2005, Comput. Math. Organ. Theory.

[16]  Yousef Saad,et al.  Iterative methods for sparse linear systems , 2003 .

[17]  Brian W. Barrett,et al.  Introducing the Graph 500 , 2010 .

[18]  Sudhakar Yalamanchili,et al.  Power Modeling for GPU Architectures Using McPAT , 2014, TODE.

[19]  Vaclav Hapla,et al.  Use of Direct Solvers in TFETI Massively Parallel Implementation , 2012, PARA.

[20]  John R. Gilbert,et al.  Challenges and Advances in Parallel Sparse Matrix-Matrix Multiplication , 2008, 2008 37th International Conference on Parallel Processing.

[21]  John R. Gilbert,et al.  The Combinatorial BLAS: design, implementation, and applications , 2011, Int. J. High Perform. Comput. Appl..

[22]  David Blaauw,et al.  A 4.5Tb/s 3.4Tb/s/W 64×64 switch fabric with self-updating least-recently-granted priority and quality-of-service arbitration in 45nm CMOS , 2012, 2012 IEEE International Solid-State Circuits Conference.

[23]  S. Dongen Graph clustering by flow simulation , 2000 .

[24]  David Blaauw,et al.  OuterSPACE: An Outer Product Based Sparse Matrix Multiplication Accelerator , 2018, 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[25]  Luke N. Olson,et al.  Optimizing Sparse Matrix—Matrix Multiplication for the GPU , 2015, ACM Trans. Math. Softw..

[26]  Philip Heng Wai Leong,et al.  A Model for Matrix Multiplication Performance on FPGAs , 2011, 2011 21st International Conference on Field Programmable Logic and Applications.

[27]  Sanu Mathew,et al.  2.9TOPS/W Reconfigurable Dense/Sparse Matrix-Multiply Accelerator with Unified INT8/INTI6/FP16 Datapath in 14NM Tri-Gate CMOS , 2018, 2018 IEEE Symposium on VLSI Circuits.

[28]  Gerald Penn,et al.  Efficient transitive closure of sparse matrices over closed semirings , 2006, Theor. Comput. Sci..

[29]  Trevor Mudge,et al.  1 A 4 . 5 Tb / s 3 . 4 Tb / s / W 64 × 64 Switch Fabric with Self-Updating Least-Recently-Granted Priority and Quality-of-Service Arbitration in 45 nm CMOS , 2018 .