Iterative Pseudo-Labeling for Speech Recognition

Pseudo-labeling has recently shown promise in end-to-end automatic speech recognition (ASR). We study Iterative Pseudo-Labeling (IPL), a semi-supervised algorithm which efficiently performs multiple iterations of pseudo-labeling on unlabeled data as the acoustic model evolves. In particular, IPL fine-tunes an existing model at each iteration using both labeled data and a subset of unlabeled data. We study the main components of IPL: decoding with a language model and data augmentation. We then demonstrate the effectiveness of IPL by achieving state-of-the-art word-error rate on the Librispeech test sets in both standard and low-resource setting. We also study the effect of language models trained on different corpora to show IPL can effectively utilize additional text. Finally, we release a new large in-domain text corpus which does not overlap with the Librispeech training transcriptions to foster research in low-resource, semi-supervised ASR

[1]  Edouard Grave,et al.  End-to-end ASR: from Supervised to Semi-Supervised Learning with Modern Architectures , 2019, ArXiv.

[2]  Quoc V. Le,et al.  Self-Training With Noisy Student Improves ImageNet Classification , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Hank Liao,et al.  Large scale deep neural network acoustic modeling with semi-supervised training data for YouTube video transcription , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[4]  Sanjeev Khudanpur,et al.  Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Jean-Luc Gauvain,et al.  Lightly supervised and unsupervised acoustic model training , 2002, Comput. Speech Lang..

[6]  Xiaofei Wang,et al.  A Comparative Study on Transformer vs RNN in Speech Applications , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[7]  Richard M. Schwartz,et al.  Unsupervised versus supervised training of acoustic models , 2008, INTERSPEECH.

[8]  Alexei Baevski,et al.  Effectiveness of self-supervised pre-training for speech recognition , 2019, ArXiv.

[9]  Luke S. Zettlemoyer,et al.  Transformers with convolutional context for ASR , 2019, ArXiv.

[10]  Taku Kudo,et al.  SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing , 2018, EMNLP.

[11]  Fernando Pereira,et al.  Weighted finite-state transducers in speech recognition , 2002, Comput. Speech Lang..

[12]  Jürgen Schmidhuber,et al.  Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[13]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[14]  Mark J. F. Gales,et al.  Unsupervised training and directed manual transcription for LVCSR , 2010, Speech Commun..

[15]  Sanjeev Khudanpur,et al.  Semi-supervised maximum mutual information training of deep neural network acoustic models , 2015, INTERSPEECH.

[16]  Gabriel Synnaeve,et al.  Semi-Supervised Speech Recognition via Local Prior Matching , 2020, ArXiv.

[17]  Edouard Grave,et al.  Reducing Transformer Depth on Demand with Structured Dropout , 2019, ICLR.

[18]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[19]  Weiran Wang,et al.  Semi-supervised ASR by End-to-end Self-training , 2020, INTERSPEECH.

[20]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[21]  Nicolas Usunier,et al.  Fully Convolutional Speech Recognition , 2018, ArXiv.

[22]  Lin-Shan Lee,et al.  Adversarial Training of End-to-end Speech Recognition Using a Criticizing Language Model , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[23]  Hermann Ney,et al.  Unsupervised training of acoustic models for large vocabulary continuous speech recognition , 2005, IEEE Transactions on Speech and Audio Processing.

[24]  Awni Hannun,et al.  Self-Training for End-to-End Speech Recognition , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[25]  Gabriel Synnaeve,et al.  Wav2Letter++: A Fast Open-source Speech Recognition System , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[26]  Kyu J. Han,et al.  State-of-the-Art Speech Recognition Using Multi-Stream Self-Attention with Dilated 1D Convolutions , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[27]  Quoc V. Le,et al.  SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition , 2019, INTERSPEECH.

[28]  Yuzong Liu,et al.  Deep Contextualized Acoustic Representations for Semi-Supervised Speech Recognition , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[29]  Alexei Baevski,et al.  wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations , 2020, NeurIPS.

[30]  Quoc V. Le,et al.  Improved Noisy Student Training for Automatic Speech Recognition , 2020, INTERSPEECH.

[31]  Ronan Collobert,et al.  Wav2Letter++: A Fast Open-source Speech Recognition System , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[32]  Kenneth Heafield,et al.  KenLM: Faster and Smaller Language Model Queries , 2011, WMT@EMNLP.

[33]  Ramón Fernández Astudillo,et al.  Self-supervised Sequence-to-sequence ASR using Unpaired Speech and Text , 2019, INTERSPEECH.

[34]  Geoffrey Zweig,et al.  Transformer-Based Acoustic Modeling for Hybrid Speech Recognition , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[35]  Jiajun Shen,et al.  Revisiting Self-Training for Neural Sequence Generation , 2020, ICLR.

[36]  Armand Joulin,et al.  Libri-Light: A Benchmark for ASR with Limited or No Supervision , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).