2-bit Conformer quantization for automatic speech recognition

Large speech models are rapidly gaining traction in research community. As a result, model compression has become an important topic, so that these models can fit in memory and be served with reduced cost. Practical approaches for compressing automatic speech recognition (ASR) model use int8 or int4 weight quantization. In this study, we propose to develop 2-bit ASR models. We explore the impact of symmetric and asymmetric quantization combined with sub-channel quantization and clipping on both LibriSpeech dataset and large-scale training data. We obtain a lossless 2-bit Conformer model with 32% model size reduction when compared to state of the art 4-bit Conformer model for LibriSpeech. With the large-scale training data, we obtain a 2-bit Conformer model with over 40% model size reduction against the 4-bit version at the cost of 17% relative word error rate degradation

[1]  David Rim,et al.  RAND: Robustness Aware Norm Decay For Quantized Seq2seq Models , 2023, arXiv.org.

[2]  Paden Tomasello,et al.  Efficient Speech Representation Learning with Low-Bit Quantization , 2022, ArXiv.

[3]  Martin H. Radfar,et al.  Sub-8-Bit Quantization for On-Device Speech Recognition: A Regularization-Free Approach , 2022, 2022 IEEE Spoken Language Technology Workshop (SLT).

[4]  Fadi Biadsy,et al.  Streaming Parrotron for on-device speech-to-speech conversion , 2022, arXiv.org.

[5]  W. Dally,et al.  Optimal Clipping and Magnitude-aware Differentiation for Improved Quantization-aware Training , 2022, ICML.

[6]  Tara N. Sainath,et al.  A Unified Cascaded Encoder ASR Model for Dynamic Model Sizes , 2022, INTERSPEECH.

[7]  Yanzhang He,et al.  4-bit Conformer with Native Quantization Aware Training for Speech Recognition , 2022, INTERSPEECH.

[8]  Zhouyuan Huo,et al.  Pseudo Label Is Better Than Human Label , 2022, INTERSPEECH.

[9]  Michael W. Mahoney,et al.  Integer-Only Zero-Shot Quantization for Efficient Speech Recognition , 2021, ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Michael W. Mahoney,et al.  A Survey of Quantization Methods for Efficient Neural Network Inference , 2021, Low-Power Computer Vision.

[11]  Yiqi Chen,et al.  PTQ-SL: Exploring the Sub-layerwise Post-training Quantization , 2021, ArXiv.

[12]  Brian Kingsbury,et al.  4-bit Quantization of LSTM-based Speech Recognition Models , 2021, Interspeech.

[13]  Rana Ali Amjad,et al.  A White Paper on Neural Network Quantization , 2021, ArXiv.

[14]  J. Malmaud,et al.  Pareto-Optimal Quantized ResNet Is Mostly 4-bit , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[15]  Hieu Duy Nguyen,et al.  Quantization Aware Training with Absolute-Cosine Regularization for Automatic Speech Recognition , 2020, INTERSPEECH.

[16]  Srikanth Madikeri,et al.  Quantization of Acoustic Model Parameters in Automatic Speech Recognition Framework , 2020, ArXiv.

[17]  Qi Han,et al.  Low-bit Quantization Needs Good Distribution , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[18]  Jinyu Li,et al.  On the Comparison of Popular End-to-End Models for Large Scale Speech Recognition , 2020, INTERSPEECH.

[19]  Jungwon Lee,et al.  Learning Sparse Low-Precision Neural Networks With Learnable Regularization , 2020, IEEE Access.

[20]  Yu Zhang,et al.  Conformer: Convolution-augmented Transformer for Speech Recognition , 2020, INTERSPEECH.

[21]  Patrick Judd,et al.  Integer Quantization for Deep Learning Inference: Principles and Empirical Evaluation , 2020, ArXiv.

[22]  Tara N. Sainath,et al.  A Streaming On-Device End-To-End Model Surpassing Server-Side Conventional Model Quality and Latency , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[23]  Oleg Rybakov,et al.  Low-Bit Quantization and Quantization-Aware Training for Small-Footprint Keyword Spotting , 2019, 2019 18th IEEE International Conference On Machine Learning And Applications (ICMLA).

[24]  Yifan Gong,et al.  Improving RNN Transducer Modeling for End-to-End Speech Recognition , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[25]  Tara N. Sainath,et al.  Streaming End-to-end Speech Recognition for Mobile Devices , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[26]  Daniel Brand,et al.  Training Deep Neural Networks with 8-bit Floating Point Numbers , 2018, NeurIPS.

[27]  Peng Gao,et al.  Compression of Acoustic Model via Knowledge Distillation and Pruning , 2018, 2018 24th International Conference on Pattern Recognition (ICPR).

[28]  Swagath Venkataramani,et al.  PACT: Parameterized Clipping Activation for Quantized Neural Networks , 2018, ArXiv.

[29]  Tara N. Sainath,et al.  State-of-the-Art Speech Recognition with Sequence-to-Sequence Models , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[30]  Ethan Fetaya,et al.  Learning Discrete Weights Using the Local Reparameterization Trick , 2017, ICLR.

[31]  Kazuhiro Nakadai,et al.  Node Pruning Based on Entropy of Weights and Node Activity for Small-Footprint Acoustic Model Based on Deep Neural Networks , 2017, INTERSPEECH.

[32]  Ran El-Yaniv,et al.  Quantized Neural Networks: Training Neural Networks with Low Precision Weights and Activations , 2016, J. Mach. Learn. Res..

[33]  Shinji Watanabe,et al.  Joint CTC-attention based end-to-end speech recognition using multi-task learning , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[34]  Rohit Prabhavalkar,et al.  On the Efficient Representation and Execution of Deep Acoustic Models , 2016, INTERSPEECH.

[35]  Ali Farhadi,et al.  XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks , 2016, ECCV.

[36]  Song Han,et al.  Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding , 2015, ICLR.

[37]  Martti Vainio,et al.  Proceedings of the Annual Conference of the International Speech Communication Association , 2016, Interspeech 2016.

[38]  Yoshua Bengio,et al.  BinaryConnect: Training Deep Neural Networks with binary weights during propagations , 2015, NIPS.

[39]  Sanjeev Khudanpur,et al.  Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[40]  Nikko Ström,et al.  Sparse connection and pruning in large dynamic artificial neural networks , 1997, EUROSPEECH.