4-bit Quantization of LSTM-based Speech Recognition Models

We investigate the impact of aggressive low-precision representations of weights and activations in two families of large LSTM-based architectures for Automatic Speech Recognition (ASR): hybrid Deep Bidirectional LSTM Hidden Markov Models (DBLSTM-HMMs) and Recurrent Neural Network Transducers (RNN-Ts). Using a 4-bit integer representation, a naı̈ve quantization approach applied to the LSTM portion of these models results in significant Word Error Rate (WER) degradation. On the other hand, we show that minimal accuracy loss is achievable with an appropriate choice of quantizers and initializations. In particular, we customize quantization schemes depending on the local properties of the network, improving recognition performance while limiting computational time. We demonstrate our solution on the Switchboard (SWB) and CallHome (CH) test sets of the NIST Hub5-2000 evaluation. DBLSTM-HMMs trained with 300 or 2000 hours of SWB data achieves <0.5% and <1% average WER degradation, respectively. On the more challenging RNN-T models, our quantization strategy limits degradation in 4-bit inference to 1.3%.

[1]  John J. Godfrey,et al.  SWITCHBOARD: telephone speech corpus for research and development , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[2]  Dong Yu,et al.  Conversational Speech Transcription Using Context-Dependent Deep Neural Networks , 2012, ICML.

[3]  Ebru Arisoy,et al.  Low-rank matrix factorization for Deep Neural Network training with high-dimensional output targets , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[4]  Yoshua Bengio,et al.  Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation , 2013, ArXiv.

[5]  Sanjeev Khudanpur,et al.  Audio augmentation for speech recognition , 2015, INTERSPEECH.

[6]  Mark J. F. Gales,et al.  Sequence Student-Teacher Training of Deep Neural Networks , 2016, INTERSPEECH.

[7]  Shuchang Zhou,et al.  Effective Quantization Methods for Recurrent Neural Networks , 2016, ArXiv.

[8]  Yevgen Chebotar,et al.  Distilling Knowledge from Ensembles of Neural Networks for Speech Recognition , 2016, INTERSPEECH.

[9]  Xiaodong Cui,et al.  Embedding-Based Speaker Adaptive Training of Deep Neural Networks , 2017, INTERSPEECH.

[10]  Kaiming He,et al.  Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour , 2017, ArXiv.

[11]  Xiaodong Cui,et al.  English Conversational Telephone Speech Recognition by Humans and Machines , 2017, INTERSPEECH.

[12]  Ran El-Yaniv,et al.  Quantized Neural Networks: Training Neural Networks with Low Precision Weights and Activations , 2016, J. Mach. Learn. Res..

[13]  Swagath Venkataramani,et al.  Bridging the Accuracy Gap for 2-bit Quantized Neural Networks (QNN) , 2018, ArXiv.

[14]  Raghuraman Krishnamoorthi,et al.  Quantizing deep convolutional networks for efficient inference: A whitepaper , 2018, ArXiv.

[15]  Hongbin Zha,et al.  Alternating Multi-bit Quantization for Recurrent Neural Networks , 2018, ICLR.

[16]  Yifan Gong,et al.  Improving RNN Transducer Modeling for End-to-End Speech Recognition , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[17]  Brian Kingsbury,et al.  Sequence Noise Injected Training for End-to-end Speech Recognition , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18]  Ian McGraw,et al.  Optimizing Speech Recognition For The Edge , 2019, ArXiv.

[19]  Tie-Yan Liu,et al.  Normalization Helps Training of Quantized LSTM , 2019, NeurIPS.

[20]  James Demmel,et al.  Large-batch training for LSTM and beyond , 2019, SC.

[21]  Tara N. Sainath,et al.  Streaming End-to-end Speech Recognition for Mobile Devices , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[22]  Hiroshi Inoue,et al.  DeepTools: Compiler and Execution Runtime Extensions for RaPiD AI Accelerator , 2019, IEEE Micro.

[23]  Quoc V. Le,et al.  SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition , 2019, INTERSPEECH.

[24]  Arash Ardakani,et al.  Learning Recurrent Binary/Ternary Weights , 2018, ICLR.

[25]  Brian Kingsbury,et al.  A Highly Efficient Distributed Deep Learning System For Automatic Speech Recognition , 2019, INTERSPEECH.

[26]  Swagath Venkataramani,et al.  Memory and Interconnect Optimizations for Peta-Scale Deep Learning Systems , 2019, 2019 IEEE 26th International Conference on High Performance Computing, Data, and Analytics (HiPC).

[27]  Brian Kingsbury,et al.  Single headed attention based sequence-to-sequence model for state-of-the-art results on Switchboard-300 , 2020, INTERSPEECH.

[28]  Hieu Duy Nguyen,et al.  Quantization Aware Training with Absolute-Cosine Regularization for Automatic Speech Recognition , 2020, INTERSPEECH.

[29]  Jianwei Yu,et al.  Low-bit Quantization of Recurrent Neural Network Language Models Using Alternating Direction Methods of Multipliers , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[30]  Patrick Judd,et al.  Integer Quantization for Deep Learning Inference: Principles and Empirical Evaluation , 2020, ArXiv.

[31]  Brian Kingsbury,et al.  Advancing RNN Transducer Technology for Speech Recognition , 2021, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[32]  Joel Silberman,et al.  A 7nm 4-Core AI Chip with 25.6TFLOPS Hybrid FP8 Training, 102.4TOPS INT4 Inference and Workload-Aware Throttling , 2021, 2021 IEEE International Solid- State Circuits Conference (ISSCC).