NVIDIA NeMo Offline Speech Translation Systems for IWSLT 2023

This paper provides an overview of NVIDIA NeMo’s speech translation systems for the IWSLT 2023 Offline Speech Translation Task. This year, we focused on end-to-end system which capitalizes on pre-trained models and synthetic data to mitigate the problem of direct speech translation data scarcity. When trained on IWSLT 2022 constrained data, our best En->De end-to-end model achieves the average score of 31 BLEU on 7 test sets from IWSLT 2010-2020 which improves over our last year cascade (28.4) and end-to-end (25.7) submissions. When trained on IWSLT 2023 constrained data, the average score drops to 29.5 BLEU.

[1]  Boris Ginsburg,et al.  Fast Conformer with Linearly Scalable Attention for Efficient Speech Recognition , 2023, arXiv.org.

[2]  Boris Ginsburg,et al.  Text-only domain adaptation for end-to-end ASR using integrated text-to-mel-spectrogram generator , 2023, INTERSPEECH 2023.

[3]  José A. R. Fonollosa,et al.  SHAS: Approaching optimal Segmentation for End-to-End Speech Translation , 2022, INTERSPEECH.

[4]  Sandeep Subramanian,et al.  NVIDIA NeMo’s Neural Machine Translation Systems for English-German and English-Russian News and Biomedical Tasks at WMT21 , 2021, WMT.

[5]  Mattia Antonino Di Gangi,et al.  MuST-C: A multilingual corpus for end-to-end speech translation , 2021, Comput. Speech Lang..

[6]  Emmanuel Dupoux,et al.  VoxPopuli: A Large-Scale Multilingual Speech Corpus for Representation Learning, Semi-Supervised Learning and Interpretation , 2021, ACL.

[7]  Abdel-rahman Mohamed,et al.  wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations , 2020, NeurIPS.

[8]  Adrian La'ncucki Fastpitch: Parallel Text-to-Speech with Pitch Prediction , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Yu Zhang,et al.  Conformer: Convolution-augmented Transformer for Speech Recognition , 2020, INTERSPEECH.

[10]  Quoc V. Le,et al.  Specaugment on Large Scale Datasets , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  A. Sanchís,et al.  Europarl-ST: A Multilingual Corpus for Speech Translation of Parliamentary Debates , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Boris Ginsburg,et al.  NeMo: a toolkit for building AI applications using Neural Modules , 2019, ArXiv.

[13]  Taku Kudo,et al.  SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing , 2018, EMNLP.

[14]  Yannick Estève,et al.  TED-LIUM 3: twice as much data and corpus repartition for experiments on speaker adaptation , 2018, SPECOM.

[15]  Frank Hutter,et al.  Decoupled Weight Decay Regularization , 2017, ICLR.

[16]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[17]  Sanjeev Khudanpur,et al.  Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[19]  Alex Graves,et al.  Sequence Transduction with Recurrent Neural Networks , 2012, ArXiv.

[20]  S. Hochreiter,et al.  Long Short-Term Memory , 1997, Neural Computation.

[21]  Ke M. Tran,et al.  FINDINGS OF THE IWSLT 2023 EVALUATION CAMPAIGN , 2023, IWSLT.

[22]  Barry Haddow,et al.  SLTEV: Comprehensive Evaluation of Spoken Language Translation , 2021, EACL.

[23]  Mauro Cettolo,et al.  The IWSLT 2018 Evaluation Campaign , 2018, IWSLT.