Scaling ASR Improves Zero and Few Shot Learning

With 4.5 million hours of English speech from 10 different sources across 120 countries and models of up to 10 billion parameters, we explore the frontiers of scale for automatic speech recognition. We propose data selection techniques to efficiently scale training data to find the most valuable samples in massive datasets. To efficiently scale model sizes, we leverage various optimizations such as sparse transducer loss and model sharding. By training 1-10B parameter universal English ASR models, we push the limits of speech recognition performance across many domains. Furthermore, our models learn powerful speech representations with zero and few-shot capabilities on novel domains and styles of speech, exceeding previous results across multiple in-house and public benchmarks. For speakers with disorders due to brain damage, our best zero-shot and few-shot models achieve 22% and 60% relative improvement on the AphasiaBank test set, respectively, while realizing the best performance on public social media videos. Furthermore, the same universal model reaches equivalent performance with 500x less in-domain data on the SPGISpeech financial-domain dataset.

[1]  Gil Keren,et al.  Alignment Restricted Streaming Recurrent Neural Network Transducer , 2021, 2021 IEEE Spoken Language Technology Workshop (SLT).

[2]  Jasha Droppo,et al.  Scaling Laws for Acoustic Models , 2021, Interspeech.

[3]  Quoc V. Le,et al.  SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition , 2019, INTERSPEECH.

[4]  Kaiming He,et al.  Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour , 2017, ArXiv.

[5]  Alec Radford,et al.  Scaling Laws for Neural Language Models , 2020, ArXiv.

[6]  Emily Mower Provost,et al.  Improving Automatic Recognition of Aphasic Speech with AphasiaBank , 2016, INTERSPEECH.

[7]  Kjell Schubert,et al.  Transformer-Transducer: End-to-End Speech Recognition with Self-Attention , 2019, ArXiv.

[8]  Liwei Wang,et al.  On Layer Normalization in the Transformer Architecture , 2020, ICML.

[9]  Armand Joulin,et al.  Libri-Light: A Benchmark for ASR with Limited or No Supervision , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Armand Joulin,et al.  Self-supervised Pretraining of Visual Features in the Wild , 2021, ArXiv.

[11]  Alexei Baevski,et al.  wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations , 2020, NeurIPS.

[12]  Myle Ott,et al.  fairseq: A Fast, Extensible Toolkit for Sequence Modeling , 2019, NAACL.

[13]  Mohammad Shoeybi,et al.  Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism , 2019, ArXiv.

[14]  Aitor Álvarez,et al.  Improving Aphasic Speech Recognition by Using Novel Semi-Supervised Learning Methods on AphasiaBank for English and Spanish , 2021, Applied Sciences.

[15]  Tara N. Sainath,et al.  RNN-T Models Fail to Generalize to Out-of-Domain Audio: Causes and Solutions , 2020, 2021 IEEE Spoken Language Technology Workshop (SLT).

[16]  Quoc V. Le,et al.  Pushing the Limits of Semi-Supervised Learning for Automatic Speech Recognition , 2020, ArXiv.

[17]  Gabriel Synnaeve,et al.  Rethinking Evaluation in ASR: Are Our Models Robust Enough? , 2020, Interspeech.

[18]  Sanjeev Khudanpur,et al.  Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Margaret Forbes,et al.  AphasiaBank: Methods for studying discourse , 2011, Aphasiology.

[20]  Gabriel Synnaeve,et al.  Robust wav2vec 2.0: Analyzing Domain Shift in Self-Supervised Pre-Training , 2021, Interspeech.

[21]  Tianqi Chen,et al.  Training Deep Nets with Sublinear Memory Cost , 2016, ArXiv.

[22]  Geoffrey Zweig,et al.  From Senones to Chenones: Tied Context-Dependent Graphemes for Hybrid Speech Recognition , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[23]  David Miller,et al.  The Fisher Corpus: a Resource for the Next Generations of Speech-to-Text , 2004, LREC.

[24]  Sanjeev Khudanpur,et al.  Audio augmentation for speech recognition , 2015, INTERSPEECH.

[25]  Hao Wu,et al.  Mixed Precision Training , 2017, ICLR.

[26]  Tara N. Sainath,et al.  Scaling End-to-End Models for Large-Scale Multilingual ASR , 2021, 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[27]  Ruslan Salakhutdinov,et al.  HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units , 2021, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[28]  Geoffrey Zweig,et al.  Transformer-Based Acoustic Modeling for Hybrid Speech Recognition , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[29]  Hank Liao,et al.  Large scale deep neural network acoustic modeling with semi-supervised training data for YouTube video transcription , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[30]  Emily Mower Provost,et al.  Automatic quantitative analysis of spontaneous aphasic speech , 2018, Speech Commun..

[31]  Garrison W. Cottrell,et al.  ReZero is All You Need: Fast Convergence at Large Depth , 2020, UAI.

[32]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[33]  Francis M. Tyers,et al.  Common Voice: A Massively-Multilingual Speech Corpus , 2020, LREC.

[34]  Shinji Watanabe,et al.  SPGISpeech: 5, 000 hours of transcribed financial audio for fully formatted end-to-end speech recognition , 2021, Interspeech.

[35]  Noam Shazeer,et al.  Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity , 2021, ArXiv.

[36]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[37]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[38]  Mohammad Norouzi,et al.  SpeechStew: Simply Mix All Available Speech Recognition Data to Train One Large Neural Network , 2021, ArXiv.