论文信息 - Unsupervised Cross-lingual Representation Learning for Speech Recognition

Unsupervised Cross-lingual Representation Learning for Speech Recognition

This paper presents XLSR which learns cross-lingual speech representations by pretraining a single model from the raw waveform of speech in multiple languages. We build on a concurrently introduced self-supervised model which is trained by solving a contrastive task over masked latent speech representations and jointly learns a quantization of the latents shared across languages. The resulting model is fine-tuned on labeled data and experiments show that cross-lingual pretraining significantly outperforms monolingual pretraining. On the CommonVoice benchmark, XLSR shows a relative phoneme error rate reduction of 72% compared to the best known results. On BABEL, our approach improves word error rate by 16% relative compared to the strongest comparable system. Our approach enables a single multilingual speech recognition model which is competitive to strong individual models. Analysis shows that the latent discrete speech representations are shared across languages with increased sharing for related languages.

Ronan Collobert | Alexei Baevski | Michael Auli | Abdelrahman Mohamed | Alexis Conneau

[1] Mark J. F. Gales,et al. Stimulated training for automatic speech recognition and keyword search in limited resource conditions , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2] Vishrav Chaudhary,et al. CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data , 2019, LREC.

[3] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[4] Frantisek Grézl,et al. Multilingually trained bottleneck features in spoken language recognition , 2017, Comput. Speech Lang..

[5] Ankur Bapna,et al. Massively Multilingual Neural Machine Translation in the Wild: Findings and Challenges , 2019, ArXiv.

[6] Myle Ott,et al. fairseq: A Fast, Extensible Toolkit for Sequence Modeling , 2019, NAACL.

[7] Gabriel Synnaeve,et al. MLS: A Large-Scale Multilingual Dataset for Speech Research , 2020, INTERSPEECH.

[8] Ben Poole,et al. Categorical Reparameterization with Gumbel-Softmax , 2016, ICLR.

[9] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[10] James Glass,et al. Learning Hierarchical Discrete Linguistic Units from Visually-Grounded Speech , 2020, ICLR.

[11] Laurent Besacier,et al. Automatic Speech Recognition for Under-Resourced Languages: Application to Vietnamese Language , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[12] Yifan Gong,et al. Cross-language knowledge transfer using multilingual deep neural network with shared hidden layers , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[13] Veselin Stoyanov,et al. Unsupervised Cross-lingual Representation Learning at Scale , 2019, ACL.

[14] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[15] Luke S. Zettlemoyer,et al. Transformers with convolutional context for ASR , 2019, ArXiv.

[16] Haizhou Li,et al. VQVAE Unsupervised Unit Discovery and Multi-scale Code2Spec Inverter for Zerospeech Challenge 2019 , 2019, INTERSPEECH.

[17] Hao Tang,et al. An Unsupervised Autoregressive Model for Speech Representation Learning , 2019, INTERSPEECH.

[18] Cordelia Schmid,et al. Product Quantization for Nearest Neighbor Search , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[19] Geoffrey E. Hinton,et al. A Simple Framework for Contrastive Learning of Visual Representations , 2020, ICML.

[20] William J. Byrne,et al. Towards language independent acoustic modeling , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[21] James R. Glass,et al. Speech2Vec: A Sequence-to-Sequence Framework for Learning Word Embeddings from Speech , 2018, INTERSPEECH.

[22] Richard M. Schwartz,et al. The 2016 BBN Georgian telephone speech keyword spotting system , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[23] Alexei Baevski,et al. Effectiveness of self-supervised pre-training for speech recognition , 2019, ArXiv.

[24] Kevin Gimpel,et al. Gaussian Error Linear Units (GELUs) , 2016 .

[25] Xiangang Li,et al. Improving Transformer-based Speech Recognition Using Unsupervised Pre-training , 2019, ArXiv.

[26] Luke S. Zettlemoyer,et al. Deep Contextualized Word Representations , 2018, NAACL.

[27] Ronan Collobert,et al. wav2vec: Unsupervised Pre-training for Speech Recognition , 2019, INTERSPEECH.

[28] Steve Renals,et al. Multilingual training of deep neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[29] Oriol Vinyals,et al. Representation Learning with Contrastive Predictive Coding , 2018, ArXiv.

[30] Francis M. Tyers,et al. Common Voice: A Massively-Multilingual Speech Corpus , 2020, LREC.

[31] Georg Heigold,et al. Multilingual acoustic models using distributed deep neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[32] Alexei Baevski,et al. vq-wav2vec: Self-Supervised Learning of Discrete Speech Representations , 2019, ICLR.

[33] Armand Joulin,et al. Unsupervised Pretraining Transfers Well Across Languages , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[34] Wonkyum Lee,et al. Semi-supervised training in low-resource ASR and KWS , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[35] Ewald van der Westhuizen,et al. Unsupervised acoustic unit discovery for speech synthesis using discrete latent-variable neural networks , 2019, INTERSPEECH.

[36] Mark J. F. Gales,et al. Speech recognition and keyword spotting for low-resource languages: Babel project research at CUED , 2014, SLTU.

[37] Guillaume Lample,et al. Cross-lingual Language Model Pretraining , 2019, NeurIPS.

[38] Tatsuya Kawahara,et al. Transfer Learning of Language-independent End-to-end ASR with Language Model Fusion , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[39] Mark J. F. Gales,et al. Confidence Estimation and Deletion Prediction Using Bidirectional Recurrent Neural Networks , 2018 .

[40] Philipp Koehn,et al. Scalable Modified Kneser-Ney Language Model Estimation , 2013, ACL.

[41] Philip N. Garner,et al. Current trends in multilingual speech processing , 2011 .

[42] Alexei Baevski,et al. wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations , 2020, NeurIPS.

[43] Kai Feng,et al. Multilingual acoustic modeling for speech recognition based on subspace Gaussian Mixture Models , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[44] Tara N. Sainath,et al. Large-Scale Multilingual Speech Recognition with a Streaming End-to-End Model , 2019, INTERSPEECH.

[45] Karen Simonyan,et al. The challenge of realistic music generation: modelling raw audio at scale , 2018, NeurIPS.

[46] Shinji Watanabe,et al. Multilingual Sequence-to-Sequence Speech Recognition: Architecture, Transfer Learning, and Language Modeling , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[47] Tara N. Sainath,et al. Multilingual Speech Recognition with a Single End-to-End Model , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[48] Mark J. F. Gales,et al. Low-Resource Speech Recognition and Keyword-Spotting , 2017, SPECOM.

[49] Kaiming He,et al. Momentum Contrast for Unsupervised Visual Representation Learning , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[50] Jonathan Le Roux,et al. An End-to-End Language-Tracking Speech Recognizer for Mixed-Language Speech , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[51] Jürgen Schmidhuber,et al. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.