End-to-End Speech Translation With Transcoding by Multi-Task Learning for Distant Language Pairs

Directly translating spoken utterances from a source language to a target language is challenging because it requires a fundamental transformation in both linguistic and para/non-linguistic features. Traditional speech-to-speech translation approaches concatenate automatic speech recognition (ASR), text-to-text machine translation (MT), and text-to-speech synthesizer (TTS) by text information. The current state-of-the-art models for ASR, MT, and TTS have mainly been built using deep neural networks, in particular, an attention-based encoder-decoder neural network with an attention mechanism. Recently, several works have constructed end-to-end direct speech-to-text translation by combining ASR and MT into a single model. However, the usefulness of these models has only been investigated on language pairs of similar syntax and word order (e.g., English-French or English-Spanish). For syntactically distant language pairs (e.g., English-Japanese), speech translation requires distant word reordering. Furthermore, parallel texts with corresponding speech utterances that are suitable for training end-to-end speech translation are generally unavailable. Collecting such corpora is usually time-consuming and expensive. This article proposes the first attempt to build an end-to-end direct speech-to-text translation system on syntactically distant language pairs that suffer from long-distance reordering. We train the model on English (subject-verb-object (SVO) word order) and Japanese (SOV word order) language pairs. To guide the attention-based encoder-decoder model on this difficult problem, we construct end-to-end speech translation with transcoding and utilize curriculum learning (CL) strategies that gradually train the network for end-to-end speech translation tasks by adapting the decoder or encoder parts. We use TTS for data augmentation to generate corresponding speech utterances from the existing parallel text data. Our experiment results show that the proposed approach provides significant improvements compared with conventional cascade models and the direct speech translation approach that uses a single model without transcoding and CL strategies.

[1]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[2]  Satoshi Nakamura,et al.  The ATR Multilingual Speech-to-Speech Translation System , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[3]  Jason Weston,et al.  Curriculum learning , 2009, ICML '09.

[4]  Adam Lopez,et al.  Towards speech-to-text translation without speech recognition , 2017, EACL.

[5]  Christopher D. Manning,et al.  Effective Approaches to Attention-based Neural Machine Translation , 2015, EMNLP.

[6]  Navdeep Jaitly,et al.  Sequence-to-Sequence Models Can Directly Translate Foreign Speech , 2017, INTERSPEECH.

[7]  Tomoki Toda,et al.  Generalizing continuous-space translation of paralinguistic information , 2013, INTERSPEECH.

[8]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[9]  Eiichiro Sumita,et al.  Creating corpora for speech-to-speech translation , 2003, INTERSPEECH.

[10]  Yoshua Bengio,et al.  Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling , 2014, ArXiv.

[11]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[12]  Jeffrey L. Elman,et al.  Learning and Evolution in Neural Networks , 1994, Adapt. Behav..

[13]  Olivier Pietquin,et al.  End-to-End Automatic Speech Translation of Audiobooks , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  Geoffrey E. Hinton,et al.  Layer Normalization , 2016, ArXiv.

[15]  Jason Weston,et al.  A Neural Attention Model for Abstractive Sentence Summarization , 2015, EMNLP.

[16]  Samy Bengio,et al.  Tacotron: Towards End-to-End Speech Synthesis , 2017, INTERSPEECH.

[17]  Daniel Marcu,et al.  Statistical Phrase-Based Translation , 2003, NAACL.

[18]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[19]  Eiichiro Sumita,et al.  Comparative study on corpora for speech translation , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[20]  Olivier Pietquin,et al.  Listen and Translate: A Proof of Concept for End-to-End Speech-to-Text Translation , 2016, NIPS 2016.

[21]  Jordi Adell,et al.  Prosody Generation for Speech-to-Speech Translation , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[22]  Shuang Xu,et al.  Speech-Transformer: A No-Recurrence Sequence-to-Sequence Model for Speech Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[23]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[24]  Rico Sennrich,et al.  Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[25]  Satoshi Nakamura,et al.  Transferring Emphasis in Speech Translation Using Hard-Attentional Neural Network Models , 2016, INTERSPEECH.

[26]  Preslav Nakov,et al.  Optimizing for Sentence-Level BLEU+1 Yields Short Translations , 2012, COLING.

[27]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[28]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[29]  David Chiang,et al.  An Attentional Model for Speech Translation Without Transcription , 2016, NAACL.

[30]  Yoshua Bengio,et al.  Attention-Based Models for Speech Recognition , 2015, NIPS.

[31]  Wei Liu,et al.  Multi-Modal Curriculum Learning for Semi-Supervised Image Classification , 2016, IEEE Transactions on Image Processing.

[32]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).