论文信息 - A 7.3 M Output Non-Zeros/J, 11.7 M Output Non-Zeros/GB Reconfigurable Sparse Matrix–Matrix Multiplication Accelerator

A 7.3 M Output Non-Zeros/J, 11.7 M Output Non-Zeros/GB Reconfigurable Sparse Matrix–Matrix Multiplication Accelerator

A sparse matrix–matrix multiplication (SpMM) accelerator with 48 heterogeneous cores and a reconfigurable memory hierarchy is fabricated in 40-nm CMOS. The compute fabric consists of dedicated floating-point multiplication units, and general-purpose Arm Cortex-M0 and Cortex-M4 cores. The on-chip memory reconfigures scratchpad or cache, depending on the phase of the algorithm. The memory and compute units are interconnected with synthesizable coalescing crossbars for efficient memory access. The 2.0-mm <inline-formula> <tex-math notation="LaTeX">$\times $ </tex-math></inline-formula> 2.6-mm chip exhibits 12.6<inline-formula> <tex-math notation="LaTeX">$\times $ </tex-math></inline-formula> (8.4<inline-formula> <tex-math notation="LaTeX">$\times $ </tex-math></inline-formula>) energy efficiency gain, 11.7<inline-formula> <tex-math notation="LaTeX">$\times $ </tex-math></inline-formula> (77.6<inline-formula> <tex-math notation="LaTeX">$\times $ </tex-math></inline-formula>) off-chip bandwidth efficiency gain, and 17.1<inline-formula> <tex-math notation="LaTeX">$\times $ </tex-math></inline-formula> (36.9<inline-formula> <tex-math notation="LaTeX">$\times $ </tex-math></inline-formula>) compute density gain s against a high-end CPU (GPU) across a diverse set of synthetic and real-world power-law graph-based sparse matrices.

[1] Gene Poole,et al. Accelerating the ANSYS Direct Sparse Solver with GPUs , 2011 .

[2] Constantine Bekas,et al. Analyzing the energy-efficiency of sparse matrix multiplication on heterogeneous systems: A comparative study of GPU, Xeon Phi and FPGA , 2016, 2016 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[3] Ngai Wong,et al. Design space exploration for sparse matrix-matrix multiplication on FPGAs , 2010, FPT.

[4] John R. Gilbert,et al. An interactive system for combinatorial scientific computing with an emphasis on programmer productivity , 2007 .

[5] Pradeep Dubey,et al. Navigating the maze of graph analytics frameworks using massive graph datasets , 2014, SIGMOD Conference.

[6] Francisco Vázquez,et al. Fast Sparse Matrix Matrix Product Based on ELLR-T and GPU Computing , 2012, 2012 IEEE 10th International Symposium on Parallel and Distributed Processing with Applications.

[7] DaltonSteven,et al. Optimizing Sparse MatrixMatrix Multiplication for the GPU , 2015 .

[8] John R. Gilbert,et al. A Unified Framework for Numerical and Combinatorial Computing , 2008, Computing in Science & Engineering.

[9] John R. Gilbert,et al. Parallel Sparse Matrix-Matrix Multiplication and Indexing: Implementation and Experiments , 2011, SIAM J. Sci. Comput..

[10] John R. Gilbert,et al. Parallel Triangle Counting and Enumeration Using Matrix Algebra , 2015, 2015 IEEE International Parallel and Distributed Processing Symposium Workshop.

[11] Richard Dorrance,et al. A 190GFLOPS/W DSP for energy-efficient sparse-BLAS in embedded IoT , 2016, 2016 IEEE Symposium on VLSI Circuits (VLSI-Circuits).

[12] Ümit V. Çatalyürek,et al. Performance Evaluation of Sparse Matrix Multiplication Kernels on Intel Xeon Phi , 2013, PPAM.

[13] Haim Kaplan,et al. Colored intersection searching via sparse rectangular matrix multiplication , 2006, SCG '06.

[14] John R. Gilbert,et al. High-Performance Graph Algorithms from Parallel Sparse Matrices , 2006, PARA.

[15] Bülent Yener,et al. Graph Theoretic and Spectral Analysis of Enron Email Data , 2005, Comput. Math. Organ. Theory.

[16] Yousef Saad,et al. Iterative methods for sparse linear systems , 2003 .

[17] Brian W. Barrett,et al. Introducing the Graph 500 , 2010 .

[18] Sudhakar Yalamanchili,et al. Power Modeling for GPU Architectures Using McPAT , 2014, TODE.

[19] Vaclav Hapla,et al. Use of Direct Solvers in TFETI Massively Parallel Implementation , 2012, PARA.

[20] John R. Gilbert,et al. Challenges and Advances in Parallel Sparse Matrix-Matrix Multiplication , 2008, 2008 37th International Conference on Parallel Processing.

[21] John R. Gilbert,et al. The Combinatorial BLAS: design, implementation, and applications , 2011, Int. J. High Perform. Comput. Appl..

[22] David Blaauw,et al. A 4.5Tb/s 3.4Tb/s/W 64×64 switch fabric with self-updating least-recently-granted priority and quality-of-service arbitration in 45nm CMOS , 2012, 2012 IEEE International Solid-State Circuits Conference.

[23] S. Dongen. Graph clustering by flow simulation , 2000 .

[24] David Blaauw,et al. OuterSPACE: An Outer Product Based Sparse Matrix Multiplication Accelerator , 2018, 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[25] Luke N. Olson,et al. Optimizing Sparse Matrix—Matrix Multiplication for the GPU , 2015, ACM Trans. Math. Softw..

[26] Philip Heng Wai Leong,et al. A Model for Matrix Multiplication Performance on FPGAs , 2011, 2011 21st International Conference on Field Programmable Logic and Applications.

[27] Sanu Mathew,et al. 2.9TOPS/W Reconfigurable Dense/Sparse Matrix-Multiply Accelerator with Unified INT8/INTI6/FP16 Datapath in 14NM Tri-Gate CMOS , 2018, 2018 IEEE Symposium on VLSI Circuits.

[28] Gerald Penn,et al. Efficient transitive closure of sparse matrices over closed semirings , 2006, Theor. Comput. Sci..

[29] Trevor Mudge,et al. 1 A 4 . 5 Tb / s 3 . 4 Tb / s / W 64 × 64 Switch Fabric with Self-Updating Least-Recently-Granted Priority and Quality-of-Service Arbitration in 45 nm CMOS , 2018 .