Scaling Speech Technology to 1, 000+ Languages

Expanding the language coverage of speech technology has the potential to improve access to information for many more people. However, current speech technology is restricted to about one hundred languages which is a small fraction of the over 7,000 languages spoken around the world. The Massively Multilingual Speech (MMS) project increases the number of supported languages by 10-40x, depending on the task. The main ingredients are a new dataset based on readings of publicly available religious texts and effectively leveraging self-supervised learning. We built pre-trained wav2vec 2.0 models covering 1,406 languages, a single multilingual automatic speech recognition model for 1,107 languages, speech synthesis models for the same number of languages, as well as a language identification model for 4,017 languages. Experiments show that our multilingual speech recognition model more than halves the word error rate of Whisper on 54 languages of the FLEURS benchmark while being trained on a small fraction of the labeled data.

[1]  Jinyu Li,et al.  Speak Foreign Languages with Your Own Voice: Cross-Lingual Neural Codec Language Modeling , 2023, ArXiv.

[2]  Tara N. Sainath,et al.  Google USM: Scaling Automatic Speech Recognition Beyond 100 Languages , 2023, ArXiv.

[3]  Shinji Watanabe,et al.  Improving Massively Multilingual ASR With Auxiliary CTC Objectives , 2023, ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  H. Saruwatari,et al.  Learning to Speak from Text: Zero-Shot Multilingual Text-to-Speech with Unsupervised Text Pretraining , 2023, ArXiv.

[5]  M. Seltzer,et al.  Massively Multilingual ASR on 70 Languages: Tokenization, Architecture, and Generalization Capabilities , 2022, ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  H. Zen,et al.  Virtuoso: Massive Multilingual Speech-Text Joint Semi-Supervised Learning for Text-To-Speech , 2022, ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  A. Conneau,et al.  FLEURS: FEW-Shot Learning Evaluation of Universal Representations of Speech , 2022, 2022 IEEE Spoken Language Technology Workshop (SLT).

[8]  Alexander H. Liu,et al.  Towards End-to-End Unsupervised Speech Recognition , 2022, 2022 IEEE Spoken Language Technology Workshop (SLT).

[9]  Jong Wook Kim,et al.  Robust Speech Recognition via Large-Scale Weak Supervision , 2022, ICML.

[10]  Nithin Rao Koluguri,et al.  AmberNet: A Compact End-to-End Model for Spoken Language Identification , 2022, ArXiv.

[11]  Daniel Whitenack,et al.  Bloom Library: Multimodal Datasets in 300+ Languages for a Variety of Downstream Tasks , 2022, EMNLP.

[12]  David R. Mortensen,et al.  ASR2K: Speech Recognition for Around 2000 Languages without Audio , 2022, INTERSPEECH.

[13]  David Ifeoluwa Adelani,et al.  BibleTTS: a large, high-fidelity, multilingual, and uniquely African speech corpus , 2022, INTERSPEECH.

[14]  Z. Chen,et al.  Building Machine Translation Systems for the Next Thousand Languages , 2022, ArXiv.

[15]  H. Zen,et al.  MAESTRO: Matched Speech Text Representations through Modality Matching , 2022, INTERSPEECH.

[16]  Ankur Bapna,et al.  mSLAM: Massively multilingual joint pre-training for speech and text , 2022, ArXiv.

[17]  Ronan Collobert,et al.  Flashlight: Enabling Innovation in Tools for Machine Learning , 2022, ICML.

[18]  Simon J. Greenhill,et al.  Global predictors of language endangerment and the future of linguistic diversity , 2021, Nature Ecology & Evolution.

[19]  Arnaldo Cândido Júnior,et al.  YourTTS: Towards Zero-Shot Multi-Speaker TTS and Zero-Shot Voice Conversion for everyone , 2021, ICML.

[20]  Juan Pino,et al.  XLS-R: Self-supervised Cross-lingual Speech Representation Learning at Scale , 2021, INTERSPEECH.

[21]  Mitesh M. Khapra,et al.  Towards Building ASR Systems for the Next Billion Users , 2021, AAAI.

[22]  Ronan Collobert,et al.  Pseudo-Labeling for Massively Multilingual Speech Recognition , 2021, ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[23]  Edward Z. Yang,et al.  Torchaudio: Building Blocks for Audio and Speech Processing , 2021, ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24]  Diptanu Gon Choudhury,et al.  Improved Language Identification Through Cross-Lingual Self-Supervised Learning , 2021, ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[25]  Ronan Collobert,et al.  Star Temporal Classification: Sequence Modeling with Partially Labeled Data , 2022, Neural Information Processing Systems.

[26]  Kenneth Ward Church,et al.  W-CTC: a Connectionist Temporal Classification Loss with Wild Cards , 2022, ICLR.

[27]  Tao Qin,et al.  A Survey on Neural Speech Synthesis , 2021, ArXiv.

[28]  Jungil Kong,et al.  Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech , 2021, ICML.

[29]  Titouan Parcollet,et al.  SpeechBrain: A General-Purpose Speech Toolkit , 2021, ArXiv.

[30]  Michael Auli,et al.  Unsupervised Speech Recognition , 2021, NeurIPS.

[31]  Tara N. Sainath,et al.  Scaling End-to-End Models for Large-Scale Multilingual ASR , 2021, 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[32]  Olatunji Ruwase,et al.  ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep learning , 2021, SC21: International Conference for High Performance Computing, Networking, Storage and Analysis.

[33]  Mohammad Norouzi,et al.  SpeechStew: Simply Mix All Available Speech Recognition Data to Train One Large Neural Network , 2021, ArXiv.

[34]  Gabriel Synnaeve,et al.  Robust wav2vec 2.0: Analyzing Domain Shift in Self-Supervised Pre-Training , 2021, Interspeech.

[35]  F. Soong,et al.  Multilingual Byte2Speech Models for Scalable Low-resource Speech Synthesis , 2021, ArXiv.

[36]  Emmanuel Dupoux,et al.  VoxPopuli: A Large-Scale Multilingual Speech Corpus for Representation Learning, Semi-Supervised Learning and Interpretation , 2021, ACL.

[37]  Meng Li,et al.  Exploring wav2vec 2.0 on speaker verification and language identification , 2020, Interspeech.

[38]  Jorgen Valk,et al.  VOXLINGUA107: A Dataset for Spoken Language Recognition , 2020, 2021 IEEE Spoken Language Technology Workshop (SLT).

[39]  Gabriel Synnaeve,et al.  Rethinking Evaluation in ASR: Are Our Models Robust Enough? , 2020, Interspeech.

[40]  Ronan Collobert,et al.  Unsupervised Cross-lingual Representation Learning for Speech Recognition , 2020, Interspeech.

[41]  Gabriel Synnaeve,et al.  MLS: A Large-Scale Multilingual Dataset for Speech Research , 2020, INTERSPEECH.

[42]  Jaehyeon Kim,et al.  HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis , 2020, NeurIPS.

[43]  Tian Huey Teh,et al.  Phonological Features for 0-shot Multilingual Speech Synthesis , 2020, INTERSPEECH.

[44]  Ondrej Dusek,et al.  One Model, Many Languages: Meta-learning for Multilingual Text-to-Speech , 2020, INTERSPEECH.

[45]  Lujun Li,et al.  CTC-Segmentation of Large Corpora for German End-to-End Speech Recognition , 2020, SPECOM.

[46]  Gabriel Synnaeve,et al.  Massively Multilingual ASR: 50 Languages, 1 Model, 1 Billion Parameters , 2020, INTERSPEECH.

[47]  Gabriel Synnaeve,et al.  Real Time Speech Enhancement in the Waveform Domain , 2020, INTERSPEECH.

[48]  Abdel-rahman Mohamed,et al.  wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations , 2020, NeurIPS.

[49]  David Yarowsky,et al.  The Johns Hopkins University Bible Corpus: 1600+ Tongues for Typological Exploration , 2020, LREC.

[50]  Dan Jurafsky,et al.  Racial disparities in automated speech recognition , 2020, Proceedings of the National Academy of Sciences.

[51]  知秀 柴田 5分で分かる!? 有名論文ナナメ読み:Jacob Devlin et al. : BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding , 2020 .

[52]  Marjan Ghazvininejad,et al.  Multilingual Denoising Pre-training for Neural Machine Translation , 2020, Transactions of the Association for Computational Linguistics.

[53]  Francis M. Tyers,et al.  Common Voice: A Massively-Multilingual Speech Corpus , 2019, LREC.

[54]  Myle Ott,et al.  Unsupervised Cross-lingual Representation Learning at Scale , 2019, ACL.

[55]  Michael Auli,et al.  vq-wav2vec: Self-Supervised Learning of Discrete Speech Representations , 2019, ICLR.

[56]  Tara N. Sainath,et al.  Large-Scale Multilingual Speech Recognition with a Streaming End-to-End Model , 2019, INTERSPEECH.

[57]  Heiga Zen,et al.  Learning to Speak Fluently in a Foreign Language: Multilingual Speech Synthesis and Cross-Language Voice Cloning , 2019, INTERSPEECH.

[58]  Alan W. Black,et al.  CMU Wilderness Multilingual Speech Dataset , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[59]  Ronan Collobert,et al.  wav2vec: Unsupervised Pre-training for Speech Recognition , 2019, INTERSPEECH.

[60]  Gabriel Synnaeve,et al.  Who Needs Words? Lexicon-Free Speech Recognition , 2019, INTERSPEECH.

[61]  Myle Ott,et al.  fairseq: A Fast, Extensible Toolkit for Sequence Modeling , 2019, NAACL.

[62]  Mona Attariyan,et al.  Parameter-Efficient Transfer Learning for NLP , 2019, ICML.

[63]  Guillaume Lample,et al.  Cross-lingual Language Model Pretraining , 2019, NeurIPS.

[64]  Gabriel Synnaeve,et al.  Wav2Letter++: A Fast Open-source Speech Recognition System , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[65]  Tara N. Sainath,et al.  Bytes Are All You Need: End-to-end Multilingual Speech Recognition and Synthesis with Bytes , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[66]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[67]  Shinji Watanabe,et al.  Multilingual Sequence-to-Sequence Speech Recognition: Architecture, Transfer Learning, and Language Modeling , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[68]  Oriol Vinyals,et al.  Representation Learning with Contrastive Predictive Coding , 2018, ArXiv.

[69]  Kevin Knight,et al.  Out-of-the-box Universal Romanization Tool uroman , 2018, ACL.

[70]  Karen Simonyan,et al.  The challenge of realistic music generation: modelling raw audio at scale , 2018, NeurIPS.

[71]  Shinji Watanabe,et al.  ESPnet: End-to-End Speech Processing Toolkit , 2018, INTERSPEECH.

[72]  Tara N. Sainath,et al.  Multilingual Speech Recognition with a Single End-to-End Model , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[73]  Jean Carrive,et al.  INA ’ S MIREX 2018 MUSIC AND SPEECH DETECTION SYSTEM , 2018 .

[74]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[75]  Ben Poole,et al.  Categorical Reparameterization with Gumbel-Softmax , 2016, ICLR.

[76]  Samy Bengio,et al.  Density estimation using Real NVP , 2016, ICLR.

[77]  Gabriel Synnaeve,et al.  Wav2Letter: an End-to-End ConvNet-based Speech Recognition System , 2016, ArXiv.

[78]  Tianqi Chen,et al.  Training Deep Nets with Sublinear Memory Cost , 2016, ArXiv.

[79]  Quoc V. Le,et al.  Listen, Attend and Spell , 2015, ArXiv.

[80]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[81]  Mark Steedman,et al.  A massively parallel corpus: the Bible in 100 languages , 2014, Lang. Resour. Evaluation.

[82]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[83]  Mark J. F. Gales,et al.  Speech recognition and keyword spotting for low-resource languages: Babel project research at CUED , 2014, SLTU.

[84]  Simon Dixon,et al.  PYIN: A fundamental frequency estimator using probabilistic threshold distributions , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[85]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[86]  Georg Heigold,et al.  Multilingual acoustic models using distributed deep neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[87]  Philip N. Garner,et al.  Current trends in multilingual speech processing , 2011 .

[88]  Kenneth Heafield,et al.  KenLM: Faster and Smaller Language Model Queries , 2011, WMT@EMNLP.

[89]  Cha Zhang,et al.  CROWDMOS: An approach for crowdsourcing mean opinion score studies , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[90]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[91]  Cordelia Schmid,et al.  Product Quantization for Nearest Neighbor Search , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[92]  Kai Feng,et al.  Multilingual acoustic modeling for speech recognition based on subspace Gaussian Mixture Models , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[93]  Su-Youn Yoon,et al.  A Python Toolkit for Universal Transliteration , 2010, LREC.

[94]  Hui Lin,et al.  A study on multilingual acoustic modeling for large vocabulary ASR , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[95]  Su-Youn Yoon,et al.  Multilingual Transliteration Using Feature based Phonetic Method , 2007, ACL.

[96]  Jürgen Schmidhuber,et al.  Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[97]  Sarah L. Nesbeitt Ethnologue: Languages of the World , 1999 .

[98]  John C. Wells,et al.  Computer-coding the IPA: a proposed extension of SAMPA , 1995 .

[99]  Worldbet,et al.  ASCII Phonetic Symbols for the World s Languages Worldbet , 1994 .