Sequence-to-Sequence Models for Emphasis Speech Translation

Speech-to-speech translation (S2ST) systems are capable of breaking language barriers in cross-lingual communication by translating speech across languages. Recent studies have introduced many improvements that allow existing S2ST systems to handle not only linguistic meaning but also paralinguistic information such as emphasis by proposing additional emphasis estimation and translation components. However, the approach used for emphasis translation is not optimal for sequence translation tasks and fails to easily handle the long-term dependencies of words and emphasis levels. It also requires the quantization of emphasis levels and treats them as discrete labels instead of continuous values. Moreover, the whole translation pipeline is fairly complex and slow because all components are trained separately without joint optimization. In this paper, we make two contributions: 1) we propose an approach that can handle continuous emphasis levels based on sequence-to-sequence models, and 2) we combine machine and emphasis translation into a single model, which greatly simplifies the translation pipeline and make it easier to perform joint optimization. Our results on an emphasis translation task indicate that our translation models outperform previous models by a large margin in both objective and subjective tests. Experiments on a joint translation model also show that our models can perform joint translation of words and emphasis with one-word delays instead of full-sentence delays while preserving the translation performance of both tasks.

[1]  Panayiotis G. Georgiou,et al.  A study on the effect of prosodic emphasis transfer on overall speech translation quality , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[2]  Eiichiro Sumita,et al.  Creating corpora for speech-to-speech translation , 2003, INTERSPEECH.

[3]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[4]  Tomoki Toda,et al.  A method for translation of paralinguistic information , 2012, IWSLT.

[5]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[6]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[7]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[8]  Alan W. Black,et al.  Intent transfer in speech-to-speech machine translation , 2012, 2012 IEEE Spoken Language Technology Workshop (SLT).

[9]  Satoshi Nakamura,et al.  Toward Expressive Speech Translation: A Unified Sequence-to-Sequence LSTMs Approach for Translating Words and Emphasis , 2017, INTERSPEECH.

[10]  Christopher D. Manning,et al.  Effective Approaches to Attention-based Neural Machine Translation , 2015, EMNLP.

[11]  Keiichi Tokuda,et al.  Speech parameter generation algorithms for HMM-based speech synthesis , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[12]  Alex Graves,et al.  Generating Sequences With Recurrent Neural Networks , 2013, ArXiv.

[13]  Quoc V. Le,et al.  Multi-task Sequence to Sequence Learning , 2015, ICLR.

[14]  Jordi Adell,et al.  Prosody Generation for Speech-to-Speech Translation , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[15]  Kai Yu,et al.  Word-level emphasis modelling in HMM-based speech synthesis , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[16]  Tomoki Toda,et al.  Generalizing continuous-space translation of paralinguistic information , 2013, INTERSPEECH.

[17]  Jörg Tiedemann,et al.  Parallel Data, Tools and Interfaces in OPUS , 2012, LREC.

[18]  Tomoki Toda,et al.  Preserving Word-Level Emphasis in Speech-to-Speech Translation , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[19]  Yoshua Bengio,et al.  Extracting and composing robust features with denoising autoencoders , 2008, ICML '08.

[20]  Tomoki Toda,et al.  Collection and analysis of a Japanese-English emphasized speech corpora , 2014, 2014 17th Oriental Chapter of the International Committee for the Co-ordination and Standardization of Speech Databases and Assessment Techniques (COCOSDA).

[21]  Satoshi Nakamura,et al.  Transferring Emphasis in Speech Translation Using Hard-Attentional Neural Network Models , 2016, INTERSPEECH.

[22]  Panayiotis G. Georgiou,et al.  Toward transfer of acoustic cues of emphasis across languages , 2013, INTERSPEECH.

[23]  Takashi Nose,et al.  A Style Control Technique for HMM-Based Expressive Speech Synthesis , 2007, IEICE Trans. Inf. Syst..

[24]  Yang Liu,et al.  Tree-to-String Alignment Template for Statistical Machine Translation , 2006, ACL.

[25]  Daniel Marcu,et al.  Statistical Phrase-Based Translation , 2003, NAACL.

[26]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[27]  Diego Marcheggiani,et al.  Encoding Sentences with Graph Convolutional Networks for Semantic Role Labeling , 2017, EMNLP.

[28]  Geoffrey E. Hinton,et al.  Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.