Sparse-T: Hardware accelerator thread for unstructured sparse data processing

Sparse matrix-dense vector (SpMV) multiplication is inherent in most scientific, neural networks and machine learning algorithms. To efficiently exploit sparsity of data in SpMV computations, several compressed data representations have been used. However, compressed data representations of sparse data can result in overheads of locating nonzero values, requiring indirect memory accesses which increases instruction count and memory access delays. We call these translations of compressed representations as metadata processing. We propose a memory-side accelerator for metadata (or indexing) computations and supplying only the required nonzero values to the processor, additionally permitting an overlap of indexing with core computations on nonzero elements. In this contribution, we target our accelerator for low-end micro-controllers with very limited memory and processing capabilities. In this paper we will explore two dedicated ASIC designs of the proposed accelerator that handles the indexed memory accesses for compressed sparse row (CSR) format working alongside a simple RISC-like programmable core. One version of the accelerator supplies only vector values corresponding to nonzero matrix values and the second version supplies both nonzero matrix and matching vector values for SpMV computations. Our experiments show speedups ranging between 1.3 and 2.1 times for SpMV for different levels of sparsity. Our accelerator also results in energy savings ranging between 15.8% and 52.7% over different matrix sizes, when compared to the baseline system with primary RISC-V core performing all computations. We use smaller synthetic matrices with different sparsity levels and larger real-world matrices with higher sparsity (below 1% non-zeros) in our experimental evaluations.

[1]  Shashank Adavally ExPress: Simultaneously Achieving Storage, Execution and Energy Efficiencies in Moderately Sparse Matrix Computations , 2020, MEMSYS.

[2]  Onur Mutlu,et al.  SMASH: Co-designing Software Compression and Hardware-Accelerated Indexing for Efficient Sparse Matrix Operations , 2019, MICRO.

[3]  Tze Meng Low,et al.  Efficient SpMV Operation for Large and Highly Sparse Matrices using Scalable Multi-way Merge Parallelization , 2019, MICRO.

[4]  Miquel Moretó,et al.  POSTER: SPiDRE: Accelerating Sparse Memory Access Patterns , 2019, 2019 28th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[5]  Andrew A. Chien,et al.  Programmable Acceleration for Sparse Matrices in a Data-Movement Limited World , 2019, 2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[6]  Yundong Zhang,et al.  Hello Edge: Keyword Spotting on Microcontrollers , 2017, ArXiv.

[7]  Ran Ginosar,et al.  Sparse Matrix Multiplication on CAM Based Accelerator , 2017, ArXiv.

[8]  William J. Dally,et al.  SCNN: An accelerator for compressed-sparse convolutional neural networks , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[9]  A. Buluç,et al.  A Work-Efficient Parallel Sparse Matrix-Sparse Vector Multiplication Algorithm , 2016, 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[10]  Song Han,et al.  Deep compression and EIE: Efficient inference engine on compressed deep neural network , 2016, 2016 IEEE Hot Chips 28 Symposium (HCS).

[11]  Srinivas Devadas,et al.  IMP: Indirect memory prefetcher , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[12]  Eriko Nurvitadhi,et al.  A sparse matrix vector multiply accelerator for support vector machine , 2015, 2015 International Conference on Compilers, Architecture and Synthesis for Embedded Systems (CASES).

[13]  Joseph L. Greathouse,et al.  Efficient Sparse Matrix-Vector Multiplication on GPUs Using the CSR Storage Format , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[14]  Magnus Jahre,et al.  An energy efficient column-major backend for FPGA SpMV accelerators , 2014, 2014 IEEE 32nd International Conference on Computer Design (ICCD).

[15]  Timothy A. Davis,et al.  The university of Florida sparse matrix collection , 2011, TOMS.

[16]  John R. Gilbert,et al.  Parallel Sparse Matrix-Matrix Multiplication and Indexing: Implementation and Experiments , 2011, SIAM J. Sci. Comput..

[17]  Samuel Williams,et al.  Reduced-Bandwidth Multithreaded Algorithms for Sparse Matrix-Vector Multiplication , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[18]  Michael Garland,et al.  Implementing sparse matrix-vector multiplication on throughput-oriented processors , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[19]  Jaejin Lee,et al.  Prefetching with Helper Threads for Loosely Coupled Multiprocessor Systems , 2009, IEEE Transactions on Parallel and Distributed Systems.

[20]  John R. Gilbert,et al.  Parallel sparse matrix-vector and matrix-transpose-vector multiplication using compressed sparse blocks , 2009, SPAA '09.

[21]  Aiyoub Farzaneh,et al.  AN EFFICIENT STORAGE FORMAT FOR LARGE SPARSE MATRICES , 2009 .

[22]  Krishna M. Kavi,et al.  Intelligent memory manager: Reducing cache pollution due to memory management functions , 2006, J. Syst. Archit..

[23]  Ron K. Cytron,et al.  Hardware Support for Fast and Bounded-Time Storage Allocation , 2002 .

[24]  James E. Smith Decoupled access/execute computer architectures , 1982, ISCA '82.