论文信息 - One-to-Many Multilingual End-to-End Speech Translation

One-to-Many Multilingual End-to-End Speech Translation

Nowadays, training end-to-end neural models for spoken language translation (SLT) still has to confront with extreme data scarcity conditions. The existing SLT parallel corpora are indeed orders of magnitude smaller than those available for the closely related tasks of automatic speech recognition (ASR) and machine translation (MT), which usually comprise tens of millions of instances. To cope with data paucity, in this paper we explore the effectiveness of transfer learning in end-to-end SLT by presenting a multilingual approach to the task. Multilingual solutions are widely studied in MT and usually rely on “target forcing”, in which multilingual parallel data are combined to train a single model by prepending to the input sequences a language token that specifies the target language. However, when tested in speech translation, our experiments show that MT-like target forcing, used as is, is not effective in discriminating among the target languages. Thus, we propose a variant that uses target-language embed-dings to shift the input representations in different portions of the space according to the language, so to better support the production of output in the desired target language. Our experiments on end-to-end SLT from English into six languages show important improvements when translating into similar languages, especially when these are supported by scarce data. Further improvements are obtained when using English ASR data as an additional language (up to +2.5 BLEU points).

[1] Lijun Wu,et al. Achieving Human Parity on Automatic Chinese to English News Translation , 2018, ArXiv.

[2] Jürgen Schmidhuber,et al. Long Short-Term Memory , 1997, Neural Computation.

[3] Mauro Cettolo,et al. A Comparison of Transformer and Recurrent Neural Networks on Multilingual Neural Machine Translation , 2018, COLING.

[4] Adam Lopez,et al. Towards speech-to-text translation without speech recognition , 2017, EACL.

[5] Luca Antiga,et al. Automatic differentiation in PyTorch , 2017 .

[6] Yuan Cao,et al. Leveraging Weakly Supervised Data to Improve End-to-end Speech-to-text Translation , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7] Philipp Koehn,et al. Findings of the 2018 Conference on Machine Translation (WMT18) , 2018, WMT.

[8] Yoshua Bengio,et al. Multi-Way, Multilingual Neural Machine Translation with a Shared Attention Mechanism , 2016, NAACL.

[9] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10] Marcello Federico,et al. Multilingual Neural Machine Translation for Low Resource Languages , 2017, CLiC-it.

[11] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[12] Tara N. Sainath,et al. Multilingual Speech Recognition with a Single End-to-End Model , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13] Shuang Xu,et al. Speech-Transformer: A No-Recurrence Sequence-to-Sequence Model for Speech Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14] Navdeep Jaitly,et al. Sequence-to-Sequence Models Can Directly Translate Foreign Speech , 2017, INTERSPEECH.

[15] Geoffrey Zweig,et al. Exploring multidimensional lstms for large vocabulary ASR , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16] Quoc V. Le,et al. Multi-task Sequence to Sequence Learning , 2015, ICLR.

[17] Tara N. Sainath,et al. State-of-the-Art Speech Recognition with Sequence-to-Sequence Models , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18] Matteo Negri,et al. Adapting Transformer to End-to-End Spoken Language Translation , 2019, INTERSPEECH.

[19] Shuang Xu,et al. Multilingual End-to-End Speech Recognition with A Single Transformer on Low-Resource Languages , 2018, ArXiv.

[20] Hermann Ney,et al. Improved training of end-to-end attention models for speech recognition , 2018, INTERSPEECH.

[21] Sergey Ioffe,et al. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[22] Kevin Knight,et al. Multi-Source Neural Translation , 2016, NAACL.

[23] Chong Wang,et al. Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin , 2015, ICML.

[24] Olivier Pietquin,et al. End-to-End Automatic Speech Translation of Audiobooks , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[25] David Chiang,et al. Tied Multitask Learning for Neural Speech Translation , 2018, NAACL.

[26] Matthias Sperber,et al. Attention-Passing Models for Robust and Data-Efficient End-to-End Speech Translation , 2019, TACL.

[27] Olivier Pietquin,et al. Listen and Translate: A Proof of Concept for End-to-End Speech-to-Text Translation , 2016, NIPS 2016.

[28] M. Picheny,et al. Comparison of Parametric Representation for Monosyllabic Word Recognition in Continuously Spoken Sentences , 2017 .

[29] Quoc V. Le,et al. Listen, attend and spell: A neural network for large vocabulary conversational speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[30] Adam Lopez,et al. Pre-training on high-resource speech recognition improves low-resource speech-to-text translation , 2018, NAACL.

[31] Geoffrey E. Hinton,et al. Layer Normalization , 2016, ArXiv.

[32] Marcello Federico,et al. Deep Neural Machine Translation with Weakly-Recurrent Units , 2018, EAMT.

[33] Rico Sennrich,et al. Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[34] Philipp Koehn,et al. Six Challenges for Neural Machine Translation , 2017, NMT@ACL.

[35] Yann Dauphin,et al. Convolutional Sequence to Sequence Learning , 2017, ICML.

[36] Satoshi Nakamura,et al. Structured-Based Curriculum Learning for End-to-End English-Japanese Speech Translation , 2017, INTERSPEECH.