MViD: Sparse Matrix-Vector Multiplication in Mobile DRAM for Accelerating Recurrent Neural Networks

Recurrent Neural Networks (RNNs) spend most of their execution time performing matrix-vector multiplication (MV-mul). Because the matrices in RNNs have poor reusability and the ever-increasing size of the matrices becomes too large to fit in the on-chip storage of mobile/IoT devices, the performance and energy efficiency of MV-mul is determined by those of main-memory DRAM. Therefore, computing MV-mul within DRAM draws much attention. However, previous studies lacked consideration for the matrix sparsity, the power constraints of DRAM devices, and concurrency in accessing DRAM from processors while performing MV-mul. We propose a main-memory architecture called MViD, which performs MV-mul by placing MAC units inside DRAM banks. For higher computational efficiency, we use a sparse matrix format and exploit quantization. Because of the limited power budget for DRAM devices, we implement the MAC units only on a portion of the DRAM banks. We architect MViD to slow down or pause MV-mul for concurrently processing memory requests from processors while satisfying the limited power budget. Our results show that MViD provides 7.2× higher throughput compared to the baseline system with four DRAM ranks (performing MV-mul in a chip-multiprocessor) while running inference of Deep Speech 2 with a memory-intensive workload.

[1]  William J. Dally,et al.  Fine-Grained DRAM: Energy-Efficient DRAM for Extreme Bandwidth Systems , 2017, 2017 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[2]  Eunhyeok Park,et al.  McDRAM: Low Latency and Energy-Efficient Matrix Computations in DRAM , 2018, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[3]  Jeff Pool,et al.  Sparse Persistent RNNs: Squeezing Large Recurrent Networks On-Chip , 2018, ICLR.

[4]  Tara N. Sainath,et al.  State-of-the-Art Speech Recognition with Sequence-to-Sequence Models , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Mattan Erez,et al.  Near Data Acceleration with Concurrent Host Access , 2019, 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA).

[6]  Sukhan Lee,et al.  Leveraging Power-Performance Relationship of Energy-Efficient Modern DRAM Devices , 2018, IEEE Access.

[7]  Onur Mutlu,et al.  Parallelism-Aware Batch Scheduling: Enhancing both Performance and Fairness of Shared DRAM Systems , 2008, 2008 International Symposium on Computer Architecture.

[8]  Brad Calder,et al.  Automatically characterizing large scale program behavior , 2002, ASPLOS X.

[9]  Jung Ho Ahn,et al.  NDA: Near-DRAM acceleration architecture leveraging commodity DRAM devices and standard memory modules , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[10]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[11]  Xiaodong Cui,et al.  English Conversational Telephone Speech Recognition by Humans and Machines , 2017, INTERSPEECH.

[12]  Jung Ho Ahn,et al.  A Comprehensive Memory Modeling Tool and Its Application to the Design and Analysis of Future Memory Hierarchies , 2008, 2008 International Symposium on Computer Architecture.

[13]  Yajie Miao,et al.  EESEN: End-to-end speech recognition using deep RNN models and WFST-based decoding , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[14]  David A. Patterson,et al.  In-datacenter performance analysis of a tensor processing unit , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[15]  Navdeep Jaitly,et al.  Towards End-To-End Speech Recognition with Recurrent Neural Networks , 2014, ICML.

[16]  Tosiron Adegbija,et al.  A Workload Characterization of the SPEC CPU2017 Benchmark Suite , 2018, 2018 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[17]  Vijay Janapa Reddi,et al.  Mobile CPU's rise to power: Quantifying the impact of generational mobile CPU design trends on performance, energy, and user satisfaction , 2016, 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[18]  Song Han,et al.  ESE: Efficient Speech Recognition Engine with Sparse LSTM on FPGA , 2016, FPGA.

[19]  Shaoli Liu,et al.  Cambricon-X: An accelerator for sparse neural networks , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[20]  Efraim Rotem,et al.  Inside 6th-Generation Intel Core: New Microarchitecture Code-Named Skylake , 2017, IEEE Micro.

[21]  Reum Oh,et al.  Design technologies for a 1.2V 2.4Gb/s/pin high capacity DDR4 SDRAM with TSVs , 2014, 2014 Symposium on VLSI Circuits Digest of Technical Papers.

[22]  Chong Wang,et al.  Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin , 2015, ICML.

[23]  Kuldip K. Paliwal,et al.  Bidirectional recurrent neural networks , 1997, IEEE Trans. Signal Process..

[24]  Kyu J. Han,et al.  The CAPIO 2017 Conversational Speech Recognition System , 2017, ArXiv.

[25]  Mark Horowitz,et al.  1.1 Computing's energy problem (and what we can do about it) , 2014, 2014 IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC).

[26]  O Seongil,et al.  McSimA+: A manycore simulator with application-level+ simulation and detailed microarchitecture modeling , 2013, 2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[27]  Sukhan Lee,et al.  Work as a team or individual: Characterizing the system-level impacts of main memory partitioning , 2017, 2017 IEEE International Symposium on Workload Characterization (IISWC).

[28]  Michael Garland,et al.  Implementing sparse matrix-vector multiplication on throughput-oriented processors , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[29]  Jing Wang,et al.  Towards Memory Friendly Long-Short Term Memory Networks (LSTMs) on Mobile GPUs , 2018, 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[30]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[31]  Erich Elsen,et al.  Exploring Sparsity in Recurrent Neural Networks , 2017, ICLR.

[32]  Geoffrey E. Hinton,et al.  Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[33]  Tze Meng Low,et al.  Efficient SpMV Operation for Large and Highly Sparse Matrices using Scalable Multi-way Merge Parallelization , 2019, MICRO.

[34]  Song Han,et al.  EIE: Efficient Inference Engine on Compressed Deep Neural Network , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[35]  Sanjeev Khudanpur,et al.  Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[36]  Eric S. Chung,et al.  A Configurable Cloud-Scale DNN Processor for Real-Time AI , 2018, 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA).

[37]  Joel Emer,et al.  Eyeriss: an Energy-efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks Accessed Terms of Use , 2022 .

[38]  Christoforos E. Kozyrakis,et al.  TETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory , 2017, ASPLOS.

[39]  Erich Elsen,et al.  Persistent RNNs: Stashing Recurrent Weights On-Chip , 2016, ICML.

[40]  Natalie D. Enright Jerger,et al.  NoC Architectures for Silicon Interposer Systems: Why Pay for more Wires when you Can Get them (from your interposer) for Free? , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[41]  Jung Ho Ahn,et al.  Chameleon: Versatile and practical near-DRAM acceleration architecture for large memory systems , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).