Unsupervised Cross-lingual Representation Learning for Speech Recognition

This paper presents XLSR which learns cross-lingual speech representations by pretraining a single model from the raw waveform of speech in multiple languages. We build on a concurrently introduced self-supervised model which is trained by solving a contrastive task over masked latent speech representations and jointly learns a quantization of the latents shared across languages. The resulting model is fine-tuned on labeled data and experiments show that cross-lingual pretraining significantly outperforms monolingual pretraining. On the CommonVoice benchmark, XLSR shows a relative phoneme error rate reduction of 72% compared to the best known results. On BABEL, our approach improves word error rate by 16% relative compared to the strongest comparable system. Our approach enables a single multilingual speech recognition model which is competitive to strong individual models. Analysis shows that the latent discrete speech representations are shared across languages with increased sharing for related languages.

[1]  Mark J. F. Gales,et al.  Stimulated training for automatic speech recognition and keyword search in limited resource conditions , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Vishrav Chaudhary,et al.  CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data , 2019, LREC.

[3]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[4]  Frantisek Grézl,et al.  Multilingually trained bottleneck features in spoken language recognition , 2017, Comput. Speech Lang..

[5]  Ankur Bapna,et al.  Massively Multilingual Neural Machine Translation in the Wild: Findings and Challenges , 2019, ArXiv.

[6]  Myle Ott,et al.  fairseq: A Fast, Extensible Toolkit for Sequence Modeling , 2019, NAACL.

[7]  Gabriel Synnaeve,et al.  MLS: A Large-Scale Multilingual Dataset for Speech Research , 2020, INTERSPEECH.

[8]  Ben Poole,et al.  Categorical Reparameterization with Gumbel-Softmax , 2016, ICLR.

[9]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[10]  James Glass,et al.  Learning Hierarchical Discrete Linguistic Units from Visually-Grounded Speech , 2020, ICLR.

[11]  Laurent Besacier,et al.  Automatic Speech Recognition for Under-Resourced Languages: Application to Vietnamese Language , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[12]  Yifan Gong,et al.  Cross-language knowledge transfer using multilingual deep neural network with shared hidden layers , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[13]  Veselin Stoyanov,et al.  Unsupervised Cross-lingual Representation Learning at Scale , 2019, ACL.

[14]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[15]  Luke S. Zettlemoyer,et al.  Transformers with convolutional context for ASR , 2019, ArXiv.

[16]  Haizhou Li,et al.  VQVAE Unsupervised Unit Discovery and Multi-scale Code2Spec Inverter for Zerospeech Challenge 2019 , 2019, INTERSPEECH.

[17]  Hao Tang,et al.  An Unsupervised Autoregressive Model for Speech Representation Learning , 2019, INTERSPEECH.

[18]  Cordelia Schmid,et al.  Product Quantization for Nearest Neighbor Search , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[19]  Geoffrey E. Hinton,et al.  A Simple Framework for Contrastive Learning of Visual Representations , 2020, ICML.

[20]  William J. Byrne,et al.  Towards language independent acoustic modeling , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[21]  James R. Glass,et al.  Speech2Vec: A Sequence-to-Sequence Framework for Learning Word Embeddings from Speech , 2018, INTERSPEECH.

[22]  Richard M. Schwartz,et al.  The 2016 BBN Georgian telephone speech keyword spotting system , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[23]  Alexei Baevski,et al.  Effectiveness of self-supervised pre-training for speech recognition , 2019, ArXiv.

[24]  Kevin Gimpel,et al.  Gaussian Error Linear Units (GELUs) , 2016 .

[25]  Xiangang Li,et al.  Improving Transformer-based Speech Recognition Using Unsupervised Pre-training , 2019, ArXiv.

[26]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[27]  Ronan Collobert,et al.  wav2vec: Unsupervised Pre-training for Speech Recognition , 2019, INTERSPEECH.

[28]  Steve Renals,et al.  Multilingual training of deep neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[29]  Oriol Vinyals,et al.  Representation Learning with Contrastive Predictive Coding , 2018, ArXiv.

[30]  Francis M. Tyers,et al.  Common Voice: A Massively-Multilingual Speech Corpus , 2020, LREC.

[31]  Georg Heigold,et al.  Multilingual acoustic models using distributed deep neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[32]  Alexei Baevski,et al.  vq-wav2vec: Self-Supervised Learning of Discrete Speech Representations , 2019, ICLR.

[33]  Armand Joulin,et al.  Unsupervised Pretraining Transfers Well Across Languages , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[34]  Wonkyum Lee,et al.  Semi-supervised training in low-resource ASR and KWS , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[35]  Ewald van der Westhuizen,et al.  Unsupervised acoustic unit discovery for speech synthesis using discrete latent-variable neural networks , 2019, INTERSPEECH.

[36]  Mark J. F. Gales,et al.  Speech recognition and keyword spotting for low-resource languages: Babel project research at CUED , 2014, SLTU.

[37]  Guillaume Lample,et al.  Cross-lingual Language Model Pretraining , 2019, NeurIPS.

[38]  Tatsuya Kawahara,et al.  Transfer Learning of Language-independent End-to-end ASR with Language Model Fusion , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[39]  Mark J. F. Gales,et al.  Confidence Estimation and Deletion Prediction Using Bidirectional Recurrent Neural Networks , 2018 .

[40]  Philipp Koehn,et al.  Scalable Modified Kneser-Ney Language Model Estimation , 2013, ACL.

[41]  Philip N. Garner,et al.  Current trends in multilingual speech processing , 2011 .

[42]  Alexei Baevski,et al.  wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations , 2020, NeurIPS.

[43]  Kai Feng,et al.  Multilingual acoustic modeling for speech recognition based on subspace Gaussian Mixture Models , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[44]  Tara N. Sainath,et al.  Large-Scale Multilingual Speech Recognition with a Streaming End-to-End Model , 2019, INTERSPEECH.

[45]  Karen Simonyan,et al.  The challenge of realistic music generation: modelling raw audio at scale , 2018, NeurIPS.

[46]  Shinji Watanabe,et al.  Multilingual Sequence-to-Sequence Speech Recognition: Architecture, Transfer Learning, and Language Modeling , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[47]  Tara N. Sainath,et al.  Multilingual Speech Recognition with a Single End-to-End Model , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[48]  Mark J. F. Gales,et al.  Low-Resource Speech Recognition and Keyword-Spotting , 2017, SPECOM.

[49]  Kaiming He,et al.  Momentum Contrast for Unsupervised Visual Representation Learning , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[50]  Jonathan Le Roux,et al.  An End-to-End Language-Tracking Speech Recognizer for Mixed-Language Speech , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[51]  Jürgen Schmidhuber,et al.  Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.