论文信息 - Whisper-KDQ: A Lightweight Whisper via Guided Knowledge Distillation and Quantization for Efficient ASR - 字舞流文

Whisper-KDQ: A Lightweight Whisper via Guided Knowledge Distillation and Quantization for Efficient ASR

Due to the rapid development of computing hardware resources and the dramatic growth of data, pre-trained models in speech recognition, such as Whisper, have significantly improved the performance of speech recognition tasks. However, these models usually have a high computational overhead, making it difficult to execute effectively on resource-constrained devices. To speed up inference and reduce model size while maintaining performance, we propose a novel guided knowledge distillation and quantization for large pre-trained model Whisper. The student model selects distillation and quantization layers based on quantization loss and distillation loss, respectively. We compressed $\text{Whisper}_\text{small}$ to $\text{Whisper}_\text{base}$ and $\text{Whisper}_\text{tiny}$ levels, making $\text{Whisper}_\text{small}$ 5.18x/10.48x smaller, respectively. Moreover, compared to the original $\text{Whisper}_\text{base}$ and $\text{Whisper}_\text{tiny}$, there is also a relative character error rate~(CER) reduction of 11.3% and 14.0% for the new compressed model respectively.

Y. Qian | Wei Wang | Haoyu Wang | Bei Liu | Xun Gong | Hang Shao

[1] Jong Wook Kim,et al. Robust Speech Recognition via Large-Scale Weak Supervision , 2022, ICML.

[2] Y. Qian,et al. Knowledge Transfer and Distillation from Autoregressive to Non-Autoregessive Speech Recognition , 2022, INTERSPEECH.

[3] Joon‐Hyuk Chang,et al. W2V2-Light: A Lightweight Version of Wav2vec 2.0 for Automatic Speech Recognition , 2022, INTERSPEECH.

[4] Danqi Chen,et al. Structured Pruning Learns Compact and Accurate Models , 2022, Annual Meeting of the Association for Computational Linguistics.

[5] Rui Wang,et al. LightHuBERT: Lightweight and Configurable Speech Representation Learning with Once-for-All Hidden-Unit BERT , 2022, INTERSPEECH.

[6] Hung-yi Lee,et al. Distilhubert: Speech Representation Learning by Layer-Wise Distillation of Hidden-Unit Bert , 2021, ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7] Alexander M. Rush,et al. Block Pruning For Faster Transformers , 2021, EMNLP.

[8] Awni Hannun,et al. The History of Speech Recognition to the Year 2030 , 2021, ArXiv.

[9] Siwei Ma,et al. Post-Training Quantization for Vision Transformer , 2021, NeurIPS.

[10] Ruslan Salakhutdinov,et al. HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units , 2021, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[11] Abdel-rahman Mohamed,et al. wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations , 2020, NeurIPS.

[12] Preslav Nakov,et al. Poor Man's BERT: Smaller and Faster Transformer Models , 2020, ArXiv.

[13] Ziheng Wang,et al. Structured Pruning of Large Language Models , 2019, EMNLP.

[14] Wei Wang,et al. Additive Powers-of-Two Quantization: An Efficient Non-uniform Discretization for Neural Networks , 2019, International Conference on Learning Representations.

[15] Xin Jiang,et al. TinyBERT: Distilling BERT for Natural Language Understanding , 2019, FINDINGS.

[16] Thomas Wolf,et al. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter , 2019, ArXiv.

[17] J. Z. Kolter,et al. Deep Equilibrium Models , 2019, NeurIPS.

[18] Fedor Moiseev,et al. Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned , 2019, ACL.

[19] Omer Levy,et al. Are Sixteen Heads Really Better than One? , 2019, NeurIPS.

[20] Quoc V. Le,et al. SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition , 2019, INTERSPEECH.

[21] Ronan Collobert,et al. wav2vec: Unsupervised Pre-training for Speech Recognition , 2019, INTERSPEECH.

[22] Mingjie Sun,et al. Rethinking the Value of Network Pruning , 2018, ICLR.

[23] Michael Carbin,et al. The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks , 2018, ICLR.

[24] Ilya Sutskever,et al. Language Models are Unsupervised Multitask Learners , 2019 .

[25] G. Hua,et al. LQ-Nets: Learned Quantization for Highly Accurate and Compact Deep Neural Networks , 2018, ECCV.

[26] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[27] Kevin Gimpel,et al. Gaussian Error Linear Units (GELUs) , 2016 .

[28] Sachin S. Talathi,et al. Fixed Point Quantization of Deep Convolutional Networks , 2015, ICML.

[29] Rico Sennrich,et al. Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[30] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[31] Yoshua Bengio,et al. FitNets: Hints for Thin Deep Nets , 2014, ICLR.

[32] Ming Yang,et al. Compressing Deep Convolutional Networks using Vector Quantization , 2014, ArXiv.

[33] K. Maekawa. CORPUS OF SPONTANEOUS JAPANESE : ITS DESIGN AND EVALUATION , 2003 .