Toward Expressive Speech Translation: A Unified Sequence-to-Sequence LSTMs Approach for Translating Words and Emphasis

Emphasis is an important piece of paralinguistic information that is used to express different intentions, attitudes, or convey emotion. Recent works have tried to translate emphasis by developing additional emphasis estimation and translation components apart from an existing speech-to-speech translation (S2ST) system. Although these approaches can preserve emphasis, they introduce more complexity to the translation pipeline. The emphasis translation component has to wait for the target language sentence and word alignments derived from a machine translation system, resulting in a significant translation delay. In this paper, we proposed an approach that jointly trains and predicts words and emphasis in a unified architecture based on sequence-to-sequence models. The proposed model not only speeds up the translation pipeline but also allows us to perform joint training. Our experiments on the emphasis and word translation tasks showed that we could achieve comparable performance for both tasks compared with previous approaches while eliminating complex dependencies.

[1]  Eiichiro Sumita,et al.  Creating corpora for speech-to-speech translation , 2003, INTERSPEECH.

[2]  Satoshi Nakamura,et al.  Transferring Emphasis in Speech Translation Using Hard-Attentional Neural Network Models , 2016, INTERSPEECH.

[3]  Alan W. Black,et al.  Intent transfer in speech-to-speech machine translation , 2012, 2012 IEEE Spoken Language Technology Workshop (SLT).

[4]  Tomoki Toda,et al.  Collection and analysis of a Japanese-English emphasized speech corpora , 2014, 2014 17th Oriental Chapter of the International Committee for the Co-ordination and Standardization of Speech Databases and Assessment Techniques (COCOSDA).

[5]  Jordi Adell,et al.  Prosody Generation for Speech-to-Speech Translation , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[6]  Hiroya Fujisaki,et al.  Information, prosody, and modeling - with emphasis on tonal features of speech - , 2004, Speech Prosody 2004.

[7]  Alex Graves,et al.  Generating Sequences With Recurrent Neural Networks , 2013, ArXiv.

[8]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[9]  Tomoki Toda,et al.  Generalizing continuous-space translation of paralinguistic information , 2013, INTERSPEECH.

[10]  Jörg Tiedemann,et al.  Parallel Data, Tools and Interfaces in OPUS , 2012, LREC.

[11]  Tomoki Toda,et al.  Preserving Word-Level Emphasis in Speech-to-Speech Translation , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[12]  Mark J. F. Gales Cluster adaptive training of hidden Markov models , 2000, IEEE Trans. Speech Audio Process..

[13]  Christopher D. Manning,et al.  Effective Approaches to Attention-based Neural Machine Translation , 2015, EMNLP.

[14]  Panayiotis G. Georgiou,et al.  A study on the effect of prosodic emphasis transfer on overall speech translation quality , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[15]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.