BigSSL: Exploring the Frontier of Large-Scale Semi-Supervised Learning for Automatic Speech Recognition

We summarize the results of a host of efforts using giant automatic speech recognition (ASR) models pre-trained using large, diverse unlabeled datasets containing approximately a million hours of audio. We find that the combination of pretraining, self-training and scaling up model size greatly increases data efficiency, even for extremely large tasks with tens of thousands of hours of labeled data. In particular, on an ASR task with 34k hours of labeled data, by fine-tuning an 8 billion parameter pre-trained Conformer model we can match state-ofthe-art (SoTA) performance with only 3% of the training data and significantly improve SoTA with the full training set. We also report on the universal benefits gained from using big pre-trained and self-trained models for a large set of downstream tasks that cover a wide range of speech domains and span multiple orders of magnitudes of dataset sizes, including obtaining SoTA performance on many public benchmarks. In addition, we utilize the learned representation of pre-trained networks to achieve SoTA results on non-ASR tasks.

[1]  Hank Liao,et al.  Large scale deep neural network acoustic modeling with semi-supervised training data for YouTube video transcription , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[2]  Gabriel Synnaeve,et al.  Iterative Pseudo-Labeling for Speech Recognition , 2020, INTERSPEECH.

[3]  Yoshua Bengio,et al.  Twin Regularization for online speech recognition , 2018, INTERSPEECH.

[4]  Hisashi Kawai,et al.  An Investigation of a Knowledge Distillation Method for CTC Acoustic Models , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Jinyu Li,et al.  Improved training for online end-to-end speech recognition systems , 2017, INTERSPEECH.

[6]  Andrew W. Senior,et al.  Fast and accurate recurrent neural network acoustic models for speech recognition , 2015, INTERSPEECH.

[7]  Quoc V. Le,et al.  SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition , 2019, INTERSPEECH.

[8]  Mike Schuster,et al.  Japanese and Korean voice search , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Alexei Baevski,et al.  wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations , 2020, NeurIPS.

[10]  Sree Hari Krishnan Parthasarathi,et al.  Lessons from Building Acoustic Models with a Million Hours of Speech , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Stylianos Asteriadis,et al.  Multimodal and Temporal Perception of Audio-visual Cues for Emotion Recognition , 2019, 2019 8th International Conference on Affective Computing and Intelligent Interaction (ACII).

[12]  Andrew Zisserman,et al.  Broaden Your Views for Self-Supervised Video Learning , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[13]  Kenneth Ward Church,et al.  Deep neural network features and semi-supervised training for low resource speech recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[14]  Alexei Baevski,et al.  Effectiveness of self-supervised pre-training for speech recognition , 2019, ArXiv.

[15]  Arun Narayanan,et al.  Improving Streaming Automatic Speech Recognition with Non-Streaming Model Distillation on Unsupervised Data , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  Salim Hariri,et al.  Paralinguistic Classification of Mask Wearing by Image Classifiers and Fusion , 2020, INTERSPEECH.

[17]  Pete Warden,et al.  Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition , 2018, ArXiv.

[18]  Quoc V. Le,et al.  Improved Noisy Student Training for Automatic Speech Recognition , 2020, INTERSPEECH.

[19]  Tara N. Sainath,et al.  A Better and Faster end-to-end Model for Streaming ASR , 2021, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  Armand Joulin,et al.  Libri-Light: A Benchmark for ASR with Limited or No Supervision , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[22]  Ronan Collobert,et al.  wav2vec: Unsupervised Pre-training for Speech Recognition , 2019, INTERSPEECH.

[23]  James R. Glass,et al.  Extracting Domain Invariant Features by Unsupervised Learning for Robust Automatic Speech Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24]  Joon Son Chung,et al.  VoxCeleb: A Large-Scale Speaker Identification Dataset , 2017, INTERSPEECH.

[25]  Paul Deléglise,et al.  TED-LIUM: an Automatic Speech Recognition dedicated corpus , 2012, LREC.

[26]  Alex Graves,et al.  Sequence Transduction with Recurrent Neural Networks , 2012, ArXiv.

[27]  Yannick Estève,et al.  TED-LIUM 3: twice as much data and corpus repartition for experiments on speaker adaptation , 2018, SPECOM.

[28]  Edouard Grave,et al.  End-to-end ASR: from Supervised to Semi-Supervised Learning with Modern Architectures , 2019, ArXiv.

[29]  Bhuvana Ramabhadran,et al.  Injecting Text in Self-Supervised Speech Pretraining , 2021, 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[30]  George Zavaliagkos,et al.  Utilizing untranscribed training data to improve perfomance , 1998, LREC.

[31]  Francis M. Tyers,et al.  Common Voice: A Massively-Multilingual Speech Corpus , 2020, LREC.

[32]  Heung-Seon Oh,et al.  Wav2KWS: Transfer Learning From Speech Representations for Keyword Spotting , 2021, IEEE Access.

[33]  Yuzong Liu,et al.  Deep Contextualized Acoustic Representations for Semi-Supervised Speech Recognition , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[34]  Yoshua Bengio,et al.  On Using Monolingual Corpora in Neural Machine Translation , 2015, ArXiv.

[35]  Joel Shor,et al.  FRILL: A Non-Semantic Speech Embedding for Mobile Devices , 2021, Interspeech 2021.

[36]  Ragini Verma,et al.  CREMA-D: Crowd-Sourced Emotional Multimodal Actors Dataset , 2014, IEEE Transactions on Affective Computing.

[37]  David Yarowsky,et al.  Unsupervised Word Sense Disambiguation Rivaling Supervised Methods , 1995, ACL.

[38]  Aren Jansen,et al.  Audio Set: An ontology and human-labeled dataset for audio events , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[39]  Richard M. Schwartz,et al.  Analysis of low-resource acoustic model self-training , 2009, INTERSPEECH.

[40]  Sanjeev Khudanpur,et al.  Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[41]  H. Zen,et al.  Semi-Supervision in ASR: Sequential MixMatch and Factorized TTS-Based Augmentation , 2021, Interspeech.

[42]  H. J. Scudder,et al.  Probability of error of some adaptive pattern-recognition machines , 1965, IEEE Trans. Inf. Theory.

[43]  Tara N. Sainath,et al.  Universal ASR: Unify and Improve Streaming ASR with Full-context Modeling , 2020, ArXiv.

[44]  Noam Shazeer,et al.  Adafactor: Adaptive Learning Rates with Sublinear Memory Cost , 2018, ICML.

[45]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[46]  Dehao Chen,et al.  Automatic Cross-Replica Sharding of Weight Update in Data-Parallel Training , 2020, ArXiv.

[47]  Philip J. B. Jackson,et al.  Speaker-dependent audio-visual emotion recognition , 2009, AVSP.

[48]  Mohammad Shoeybi,et al.  Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism , 2019, ArXiv.

[49]  George Saon,et al.  Knowledge Distillation from Offline to Streaming RNN Transducer for End-to-End Speech Recognition , 2020, INTERSPEECH.

[50]  Tara N. Sainath,et al.  RNN-T Models Fail to Generalize to Out-of-Domain Audio: Causes and Solutions , 2020, 2021 IEEE Spoken Language Technology Workshop (SLT).

[51]  Subhashini Venugopalan,et al.  Comparing Supervised Models And Learned Speech Representations For Classifying Intelligibility Of Disordered Speech On Selected Phrases , 2021, Interspeech.

[52]  Alexei Baevski,et al.  vq-wav2vec: Self-Supervised Learning of Discrete Speech Representations , 2019, ICLR.

[53]  Awni Hannun,et al.  Self-Training for End-to-End Speech Recognition , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[54]  Karol J. Piczak ESC: Dataset for Environmental Sound Classification , 2015, ACM Multimedia.

[55]  Gabriel Synnaeve,et al.  Self-Training and Pre-Training are Complementary for Speech Recognition , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[56]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[57]  Noam Shazeer,et al.  GSPMD: General and Scalable Parallelization for ML Computation Graphs , 2021, ArXiv.

[58]  Orhan Firat,et al.  GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding , 2020, ICLR.

[59]  Mohammad Norouzi,et al.  SpeechStew: Simply Mix All Available Speech Recognition Data to Train One Large Neural Network , 2021, ArXiv.

[60]  Quoc V. Le,et al.  Specaugment on Large Scale Datasets , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[61]  Jürgen Schmidhuber,et al.  Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[62]  Jon Barker,et al.  CHiME-6 Challenge: Tackling Multispeaker Speech Recognition for Unsegmented Recordings , 2020, 6th International Workshop on Speech Processing in Everyday Environments (CHiME 2020).

[63]  Dustin Tran,et al.  Mesh-TensorFlow: Deep Learning for Supercomputers , 2018, NeurIPS.

[64]  Björn W. Schuller,et al.  The INTERSPEECH 2020 Computational Paralinguistics Challenge: Elderly Emotion, Breathing & Masks , 2020, INTERSPEECH.

[65]  Tara N. Sainath,et al.  Feature Learning with Raw-Waveform CLDNNs for Voice Activity Detection , 2016, INTERSPEECH.

[66]  Hao Tang,et al.  An Unsupervised Autoregressive Model for Speech Representation Learning , 2019, INTERSPEECH.

[67]  Gabriel Synnaeve,et al.  Semi-Supervised Speech Recognition via Local Prior Matching , 2020, ArXiv.

[68]  Yuri Y. Khokhlov,et al.  The STC System for the CHiME-6 Challenge , 2020, 6th International Workshop on Speech Processing in Everyday Environments (CHiME 2020).

[69]  Tara N. Sainath,et al.  Scaling End-to-End Models for Large-Scale Multilingual ASR , 2021, 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[70]  Ellen Riloff,et al.  Learning Extraction Patterns for Subjective Expressions , 2003, EMNLP.

[71]  Jean-Luc Gauvain,et al.  Lightly Supervised Acoustic Model Training , 2000 .

[72]  Aren Jansen,et al.  Unsupervised Learning of Semantic Audio Representations , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[73]  Frank Rudzicz,et al.  On the importance of normative data in speech-based assessment , 2017, ArXiv.

[74]  Christopher Joseph Pal,et al.  Twin Networks: Matching the Future for Sequence Generation , 2017, ICLR.

[75]  Olivier Siohan,et al.  Bridging the gap between streaming and non-streaming ASR systems bydistilling ensembles of CTC and RNN-T models , 2021, Interspeech 2021.

[76]  Reinhold Haeb-Umbach,et al.  Front-end processing for the CHiME-5 dinner party scenario , 2018, 5th International Workshop on Speech Processing in Everyday Environments (CHiME 2018).

[77]  Kartik Audhkhasi,et al.  Improved Knowledge Distillation from Bi-Directional to Uni-Directional LSTM CTC for End-to-End Speech Recognition , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[78]  Armand Joulin,et al.  Unsupervised Pretraining Transfers Well Across Languages , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[79]  Qian Zhang,et al.  Transformer Transducer: A Streamable Speech Recognition Model with Transformer Encoders and RNN-T Loss , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[80]  Quoc V. Le,et al.  Self-Training With Noisy Student Improves ImageNet Classification , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[81]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[82]  Naoyuki Kanda,et al.  Large-Scale Pre-Training of End-to-End Multi-Talker ASR for Meeting Transcription with Single Distant Microphone , 2021, Interspeech 2021.

[83]  Yiming Yang,et al.  Transformer-XL: Attentive Language Models beyond a Fixed-Length Context , 2019, ACL.

[84]  James R. Glass,et al.  Speech2Vec: A Sequence-to-Sequence Framework for Learning Word Embeddings from Speech , 2018, INTERSPEECH.

[85]  Jean Carletta,et al.  The AMI Meeting Corpus: A Pre-announcement , 2005, MLMI.

[86]  Chris Dyer,et al.  Learning Robust and Multilingual Speech Representations , 2020, FINDINGS.

[87]  Aren Jansen,et al.  Towards Learning a Universal Non-Semantic Representation of Speech , 2020, INTERSPEECH.

[88]  Aäron van den Oord,et al.  Multi-Format Contrastive Learning of Audio Representations , 2021, ArXiv.

[89]  Yu Zhang,et al.  Conformer: Convolution-augmented Transformer for Speech Recognition , 2020, INTERSPEECH.

[90]  Arun Narayanan,et al.  Toward Domain-Invariant Speech Recognition via Large Scale Training , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[91]  Brian Kingsbury,et al.  On the limit of English conversational speech recognition , 2021, Interspeech.

[92]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[93]  Tara N. Sainath,et al.  Streaming End-to-end Speech Recognition for Mobile Devices , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[94]  Olatunji Ruwase,et al.  ZeRO: Memory optimizations Toward Training Trillion Parameter Models , 2020, SC20: International Conference for High Performance Computing, Networking, Storage and Analysis.

[95]  Tara N. Sainath,et al.  Semi-supervised Training for End-to-end Models via Weak Distillation , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[96]  Quoc V. Le,et al.  Pushing the Limits of Semi-Supervised Learning for Automatic Speech Recognition , 2020, ArXiv.

[97]  Oriol Vinyals,et al.  Representation Learning with Contrastive Predictive Coding , 2018, ArXiv.

[98]  Govind Mittal,et al.  Spoken Language Identification using ConvNets , 2019, AmI.

[99]  Gabriel Synnaeve,et al.  Rethinking Evaluation in ASR: Are Our Models Robust Enough? , 2020, Interspeech.

[100]  Ron J. Weiss,et al.  Unsupervised Speech Representation Learning Using WaveNet Autoencoders , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.