Predicting Multi-Codebook Vector Quantization Indexes for Knowledge Distillation

Knowledge distillation(KD) is a common approach to improve model performance in automatic speech recognition (ASR), where a student model is trained to imitate the output behaviour of a teacher model. However, traditional KD methods suffer from teacher label storage issue, especially when the training corpora are large. Although on-the-fly teacher label generation tackles this issue, the training speed is significantly slower as the teacher model has to be evaluated every batch. In this paper, we reformulate the generation of teacher label as a codec problem. We propose a novel Multi-codebook Vector Quantization (MVQ) approach that compresses teacher embeddings to codebook indexes (CI). Based on this, a KD training framework (MVQ-KD) is proposed where a student model predicts the CI generated from the embeddings of a self-supervised pre-trained teacher model. Experiments on the LibriSpeech clean-100 hour show that MVQ-KD framework achieves comparable performance as traditional KD methods (l1, l2), while requiring 256 times less storage. When the full LibriSpeech dataset is used, MVQ-KD framework results in 13.8% and 8.2% relative word error rate reductions (WERRs) for non -streaming transducer on test-clean and test-other and 4.0% and 4.9% for streaming transducer. The implementation of this work is already released as a part of the open-source project icefall.

[1]  Daniel Povey,et al.  Pruned RNN-T for fast, memory-efficient ASR training , 2022, INTERSPEECH.

[2]  Yonghui Wu,et al.  Self-supervised Learning with Random-projection Quantizer for Speech Recognition , 2022, ICML.

[3]  Jinyu Li,et al.  WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing , 2021, IEEE Journal of Selected Topics in Signal Processing.

[4]  P. Woodland,et al.  Knowledge Distillation for Neural Transducers from Large Self-Supervised Pre-Trained Models , 2021, ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Li Dong,et al.  BEiT: BERT Pre-Training of Image Transformers , 2021, ICLR.

[6]  A. Heba,et al.  A Fine-tuned Wav2vec 2.0/HuBERT Benchmark For Speech Emotion Recognition, Speaker Verification and Spoken Language Understanding , 2021, ArXiv.

[7]  Ruslan Salakhutdinov,et al.  HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units , 2021, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[8]  Athanasios Mouchtaris,et al.  CoDERT: Distilling Encoder Representations with Co-learning for Transducer-based Speech Recognition , 2021, Interspeech.

[9]  Luciana Ferrer,et al.  Emotion Recognition from Speech Using Wav2vec 2.0 Embeddings , 2021, Interspeech.

[10]  Daniel S. Park,et al.  Efficient Knowledge Distillation for RNN-Transducer Models , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Arun Narayanan,et al.  Improving Streaming Automatic Speech Recognition with Non-Streaming Model Distillation on Unsupervised Data , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Lei Xie,et al.  Unified Streaming and Non-streaming Two-pass End-to-end Model for Speech Recognition , 2020, ArXiv.

[13]  George Saon,et al.  Knowledge Distillation from Offline to Streaming RNN Transducer for End-to-End Speech Recognition , 2020, INTERSPEECH.

[14]  Quoc V. Le,et al.  Pushing the Limits of Semi-Supervised Learning for Automatic Speech Recognition , 2020, ArXiv.

[15]  Abdel-rahman Mohamed,et al.  wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations , 2020, NeurIPS.

[16]  Yu Zhang,et al.  Conformer: Convolution-augmented Transformer for Speech Recognition , 2020, INTERSPEECH.

[17]  Michael Auli,et al.  vq-wav2vec: Self-Supervised Learning of Discrete Speech Representations , 2019, ICLR.

[18]  Yashesh Gaur,et al.  Domain Adaptation via Teacher-Student Learning for End-to-End Speech Recognition , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[19]  Quoc V. Le,et al.  SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition , 2019, INTERSPEECH.

[20]  Ronan Collobert,et al.  wav2vec: Unsupervised Pre-training for Speech Recognition , 2019, INTERSPEECH.

[21]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[22]  Sanjeev Khudanpur,et al.  A Teacher-Student Learning Approach for Unsupervised Domain Adaptation of Sequence-Trained ASR Models , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[23]  Naoyuki Kanda,et al.  Sequence Distillation for Purely Sequence Trained Acoustic Models , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24]  Morgan Sonderegger,et al.  Montreal Forced Aligner: Trainable Text-Speech Alignment Using Kaldi , 2017, INTERSPEECH.

[25]  Daniel Povey,et al.  MUSAN: A Music, Speech, and Noise Corpus , 2015, ArXiv.

[26]  Sanjeev Khudanpur,et al.  Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[27]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[28]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[29]  Yoshua Bengio,et al.  Deep Learning of Representations for Unsupervised and Transfer Learning , 2011, ICML Unsupervised and Transfer Learning.

[30]  Syed A. Rizvi,et al.  Advances in residual vector quantization: a review , 1996, IEEE Trans. Image Process..

[31]  Christopher F. Barnes,et al.  Embedded wavelet zerotree coding with direct sum quantization structures , 1995, Proceedings DCC '95 Data Compression Conference.

[32]  Richard L. Frost,et al.  Vector quantizers with direct sum codebooks , 1993, IEEE Trans. Inf. Theory.

[33]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .