Large-Scale Self- and Semi-Supervised Learning for Speech Translation

In this paper, we improve speech translation (ST) through effectively leveraging large quantities of unlabeled speech and text data in different and complementary ways. We explore both pretraining and self-training by using the large Libri-Light speech audio corpus and language modeling with CommonCrawl. Our experiments improve over the previous state of the art by 2.6 BLEU on average on all four considered CoVoST 2 language pairs via a simple recipe of combining wav2vec 2.0 pretraining, a single iteration of self-training and decoding with a language model. Different to existing work, our approach does not leverage any other supervision than ST data. Code and models will be publicly released.

[1]  Juan Pino,et al.  CoVoST: A Diverse Multilingual Speech-To-Text Translation Corpus , 2020, LREC.

[2]  Matteo Negri,et al.  One-to-Many Multilingual End-to-End Speech Translation , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[3]  Sharon Goldwater,et al.  Analyzing ASR Pretraining for Low-Resource Speech-to-Text Translation , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Arya D. McCarthy,et al.  SkinAugment: Auto-Encoding Speaker Conversions for Automatic Speech Translation , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Awni Hannun,et al.  Self-Training for End-to-End Speech Recognition , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Gabriel Synnaeve,et al.  Self-Training and Pre-Training are Complementary for Speech Recognition , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Kevin Duh,et al.  Multilingual End-to-End Speech Translation , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[8]  Gabriel Synnaeve,et al.  Iterative Pseudo-Labeling for Speech Recognition , 2020, INTERSPEECH.

[9]  Vishrav Chaudhary,et al.  CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data , 2019, LREC.

[10]  Alexei Baevski,et al.  wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations , 2020, NeurIPS.

[11]  Navdeep Jaitly,et al.  Sequence-to-Sequence Models Can Directly Translate Foreign Speech , 2017, INTERSPEECH.

[12]  Adam Lopez,et al.  Pre-training on high-resource speech recognition improves low-resource speech-to-text translation , 2018, NAACL.

[13]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[14]  Ronan Collobert,et al.  wav2vec: Unsupervised Pre-training for Speech Recognition , 2019, INTERSPEECH.

[15]  Philipp Koehn,et al.  Scalable Modified Kneser-Ney Language Model Estimation , 2013, ACL.

[16]  Ronan Collobert,et al.  Unsupervised Cross-lingual Representation Learning for Speech Recognition , 2020, Interspeech.

[17]  Veselin Stoyanov,et al.  Unsupervised Cross-lingual Representation Learning at Scale , 2019, ACL.

[18]  Jiatao Gu,et al.  Self-Supervised Representations Improve End-to-End Speech Translation , 2020, INTERSPEECH.

[19]  Alfons Juan-Císcar,et al.  Europarl-ST: A Multilingual Corpus for Speech Translation of Parliamentary Debates , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  Jiajun Shen,et al.  Revisiting Self-Training for Neural Sequence Generation , 2020, ICLR.

[21]  Fethi Bougares,et al.  Investigating Self-Supervised Pre-Training for End-to-End Speech Translation , 2020, INTERSPEECH.

[22]  Zhenglu Yang,et al.  Bridging the Gap between Pre-Training and Fine-Tuning for End-to-End Speech Translation , 2020, AAAI.

[23]  Quoc V. Le,et al.  Improved Noisy Student Training for Automatic Speech Recognition , 2020, INTERSPEECH.

[24]  William D. Lewis,et al.  Intelligent Selection of Language Model Training Data , 2010, ACL.

[25]  Elizabeth Salesky,et al.  Exploring Phoneme-Level Speech Representations for End-to-End Speech Translation , 2019, ACL.

[26]  Vishrav Chaudhary,et al.  Self-training Improves Pre-training for Natural Language Understanding , 2020, NAACL.

[27]  Yun Tang,et al.  Multilingual Speech Translation with Efficient Finetuning of Pretrained Models. , 2020 .

[28]  Yuan Cao,et al.  Leveraging Weakly Supervised Data to Improve End-to-end Speech-to-text Translation , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[29]  Olivier Pietquin,et al.  End-to-End Automatic Speech Translation of Audiobooks , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[30]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[31]  Alexei Baevski,et al.  Adaptive Input Representations for Neural Language Modeling , 2018, ICLR.

[32]  Qiantong Xu,et al.  Self-Training for End-to-End Speech Translation , 2020, INTERSPEECH.

[33]  Quoc V. Le,et al.  Rethinking Pre-training and Self-training , 2020, NeurIPS.

[34]  Arya D. McCarthy,et al.  Harnessing Indirect Training Data for End-to-End Automatic Speech Translation: Tricks of the Trade , 2019, IWSLT.

[35]  Armand Joulin,et al.  Libri-Light: A Benchmark for ASR with Limited or No Supervision , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[36]  Edouard Grave,et al.  End-to-end ASR: from Supervised to Semi-Supervised Learning with Modern Architectures , 2019, ArXiv.

[37]  J. Pino,et al.  CoVoST 2 and Massively Multilingual Speech-to-Text Translation , 2020 .

[38]  Quoc V. Le,et al.  SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition , 2019, INTERSPEECH.

[39]  Yoshua Bengio,et al.  On Using Monolingual Corpora in Neural Machine Translation , 2015, ArXiv.

[40]  Dmitriy Genzel,et al.  A General Multi-Task Learning Framework to Leverage Text Data for Speech to Text Tasks , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[41]  Yoshua Bengio,et al.  Attention-Based Models for Speech Recognition , 2015, NIPS.

[42]  Quoc V. Le,et al.  Pushing the Limits of Semi-Supervised Learning for Automatic Speech Recognition , 2020, ArXiv.

[43]  Oriol Vinyals,et al.  Representation Learning with Contrastive Predictive Coding , 2018, ArXiv.

[44]  Geoffrey E. Hinton,et al.  A Simple Framework for Contrastive Learning of Visual Representations , 2020, ICML.

[45]  Gabriel Synnaeve,et al.  slimIPL: Language-Model-Free Iterative Pseudo-Labeling , 2020, Interspeech.

[46]  Kaiming He,et al.  Momentum Contrast for Unsupervised Visual Representation Learning , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).