论文信息 - Joint Unsupervised and Supervised Training for Multilingual ASR

Joint Unsupervised and Supervised Training for Multilingual ASR

Self-supervised training has shown promising gains in pretraining models and facilitating the downstream finetuning for speech recognition, like multilingual ASR. Most existing methods adopt a 2-stage scheme where the self-supervised loss is optimized in the first pretraining stage, and the standard supervised finetuning resumes in the second stage. In this paper, we propose an end-to-end (E2E) Joint Unsupervised and Supervised Training (JUST) method to combine the supervised RNN-T loss and the self-supervised contrastive and masked language modeling (MLM) losses. We validate its performance on the public dataset Multilingual LibriSpeech (MLS), which includes 8 languages and is extremely imbalanced. On MLS, we explore (1) JUST trained from scratch, and (2) JUST finetuned from a pretrained checkpoint. Experiments show that JUST can consistently outperform other existing state-of-the-art methods, and beat the monolingual baseline by a significant margin, demonstrating JUST’s capability of handling low-resource languages in multilingual ASR. Our average WER of all languages outperforms average monolingual baseline by 33.3%, and the state-of-the-art 2-stage XLSR by 32%. On low-resource languages like Polish, our WER is less than half of the monolingual baseline and even beats the supervised transfer learning method which uses external supervision.

[1] Ben Poole,et al. Categorical Reparameterization with Gumbel-Softmax , 2016, ICLR.

[2] Yoshua Bengio,et al. Learning Problem-agnostic Speech Representations from Multiple Self-supervised Tasks , 2019, INTERSPEECH.

[3] Yu Zhang,et al. Conformer: Convolution-augmented Transformer for Speech Recognition , 2020, INTERSPEECH.

[4] Karen Livescu,et al. Unsupervised Pre-Training of Bidirectional Speech Encoders via Masked Reconstruction , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[6] Gabriel Synnaeve,et al. Joint Masked CPC and CTC Training for ASR , 2020, ArXiv.

[7] Alex Graves,et al. Sequence Transduction with Recurrent Neural Networks , 2012, ArXiv.

[8] Furu Wei,et al. UniSpeech: Unified Speech Representation Learning with Labeled and Unlabeled Data , 2021, ICML.

[9] Chung-Cheng Chiu,et al. w2v-BERT: Combining Contrastive Learning and Masked Language Modeling for Self-Supervised Speech Pre-Training , 2021, 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[10] Ruslan Salakhutdinov,et al. HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units , 2021, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[11] Jürgen Schmidhuber,et al. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[12] Ronan Collobert,et al. Unsupervised Cross-lingual Representation Learning for Speech Recognition , 2020, Interspeech.

[13] Samuel Kessler,et al. Continual-wav2vec2: an Application of Continual Learning for Self-Supervised Automatic Speech Recognition , 2021, ArXiv.

[14] Caiming Xiong,et al. Representation Learning for Sequence Data with Deep Autoencoding Predictive Components , 2020, ArXiv.

[15] Alexei Baevski,et al. wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations , 2020, NeurIPS.

[16] Bhuvana Ramabhadran,et al. Injecting Text in Self-Supervised Speech Pretraining , 2021, 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[17] Gabriel Synnaeve,et al. MLS: A Large-Scale Multilingual Dataset for Speech Research , 2020, INTERSPEECH.

[18] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[19] Wanxiang Che,et al. Recall and Learn: Fine-tuning Deep Pretrained Language Models with Less Forgetting , 2020, EMNLP.

[20] Quoc V. Le,et al. Pushing the Limits of Semi-Supervised Learning for Automatic Speech Recognition , 2020, ArXiv.

[21] Oriol Vinyals,et al. Representation Learning with Contrastive Predictive Coding , 2018, ArXiv.

[22] Tara N. Sainath,et al. Scaling End-to-End Models for Large-Scale Multilingual ASR , 2021, 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[23] Doug Downey,et al. Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks , 2020, ACL.

[24] Wolfgang Macherey,et al. Self-supervised and Supervised Joint Training for Resource-rich Machine Translation , 2021, ICML.

[25] Guillaume Lample,et al. Cross-lingual Language Model Pretraining , 2019, NeurIPS.

[26] Armand Joulin,et al. Unsupervised Pretraining Transfers Well Across Languages , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).