Puncturing the memory wall: Joint optimization of network compression with approximate memory for ASR application

The automatic speech recognition (ASR) system is becoming increasingly irreplaceable in smart speech interaction applications. Nonetheless, these applications confront the memory wall when embedded in the energy and memory constrained Internet of Things devices. Therefore, it is extremely challenging but imperative to design a memory-saving and energy-saving ASR system. This paper proposes a joint-optimized scheme of network compression with approximate memory for the economical ASR system. At the algorithm level, this work presents block-based pruning and quantization with error model (BPQE), an optimized compression framework including a novel pruning technique coordinated with low-precision quantization and the approximate memory scheme. The BPQE compressed recurrent neural network (RNN) model comes with an ultra-high compression rate and fine-grained structured pattern that reduce the amount of memory access immensely. At the hardware level, this work presents an ASR-adapted incremental retraining method to further obtain optimal power saving. This retraining method stimulates the utility of the approximate memory scheme, while maintaining considerable accuracy. According to the experiment results, the proposed joint-optimized scheme achieves 58.6% power saving and 40× memory saving with a phone error rate of 20%.

[1]  Marian Verhelst,et al.  Laika: A 5uW Programmable LSTM Accelerator for Always-on Keyword Spotting in 65nm CMOS , 2018, ESSCIRC 2018 - IEEE 44th European Solid State Circuits Conference (ESSCIRC).

[2]  Paul Ampadu,et al.  Enabling Approximate Storage through Lossy Media Data Compression , 2019, ACM Great Lakes Symposium on VLSI.

[3]  David Renshaw,et al.  European Solid-State Circuits Conference (ESSCIRC) , 1987 .

[4]  Anantha P. Chandrakasan,et al.  A Low-Power Speech Recognizer and Voice Activity Detector Using Deep Neural Networks , 2018, IEEE Journal of Solid-State Circuits.

[5]  Ping Liu,et al.  Filter Pruning via Geometric Median for Deep Convolutional Neural Networks Acceleration , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Yi Yang,et al.  Network Pruning via Transformable Architecture Search , 2019, NeurIPS.

[7]  Yanzhi Wang,et al.  Toward Extremely Low Bit and Lossless Accuracy in DNNs with Progressive ADMM , 2019, ArXiv.

[8]  Jae-Joon Han,et al.  Learning to Quantize Deep Networks by Optimizing Quantization Intervals With Task Loss , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Lin Xu,et al.  Incremental Network Quantization: Towards Lossless CNNs with Low-Precision Weights , 2017, ICLR.

[10]  Xianglong Liu,et al.  Differentiable Soft Quantization: Bridging Full-Precision and Low-Bit Neural Networks , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[11]  Song Han,et al.  Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding , 2015, ICLR.

[12]  Yoshua Bengio,et al.  Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling , 2014, ArXiv.

[13]  Yifan Gong,et al.  RTMobile: Beyond Real-Time Mobile Acceleration of RNNs for Speech Recognition , 2020, 2020 57th ACM/IEEE Design Automation Conference (DAC).

[14]  Jing Liu,et al.  Discrimination-aware Channel Pruning for Deep Neural Networks , 2018, NeurIPS.

[15]  Bo Chen,et al.  Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[16]  Boris Murmann,et al.  SRAM voltage scaling for energy-efficient convolutional neural networks , 2017, 2017 18th International Symposium on Quality Electronic Design (ISQED).

[17]  Ming-Hsien Tu,et al.  A 0.5-V 28-nm 256-kb Mini-Array Based 6T SRAM With Vtrip-Tracking Write-Assist , 2017, IEEE Transactions on Circuits and Systems I: Regular Papers.

[18]  Shuchang Zhou,et al.  DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients , 2016, ArXiv.

[19]  Titouan Parcollet,et al.  The Pytorch-kaldi Speech Recognition Toolkit , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  Yajie Miao,et al.  EESEN: End-to-end speech recognition using deep RNN models and WFST-based decoding , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[21]  Deming Chen,et al.  µL2Q: An Ultra-Low Loss Quantization Method for DNN Compression , 2019, 2019 International Joint Conference on Neural Networks (IJCNN).

[22]  Yiran Chen,et al.  Learning Structured Sparsity in Deep Neural Networks , 2016, NIPS.

[23]  Chen Zhang,et al.  Efficient and Effective Sparse LSTM on FPGA with Bank-Balanced Sparsity , 2019, FPGA.

[24]  Jonathan G. Fiscus,et al.  DARPA TIMIT:: acoustic-phonetic continuous speech corpus CD-ROM, NIST speech disc 1-1.1 , 1993 .

[25]  Jiayu Li,et al.  ADMM-NN: An Algorithm-Hardware Co-Design Framework of DNNs Using Alternating Direction Methods of Multipliers , 2018, ASPLOS.

[26]  Qinru Qiu,et al.  C-LSTM: Enabling Efficient LSTM using Structured Compression Techniques on FPGAs , 2018, FPGA.