Towards simultaneous interpreting: the timing of incremental machine translation and speech synthesis

In simultaneous interpreting, human experts incrementally construct and extend partial hypotheses about the source speaker’s message, and start to verbalize a corresponding message in the target language, based on a partial translation – which may have to be corrected occasionally. They commence the target utterance in the hope that they will be able to finish understanding the source speaker’s message and determine its translation in time for the unfolding delivery. Of course, both incremental understanding and translation by humans can be garden-pathed, although experts are able to optimize their delivery so as to balance the goals of minimal latency, translation quality and high speech fluency with few corrections. We investigate the temporal properties of both translation input and output to evaluate the tradeoff between low latency and translation quality. In addition, we estimate the improvements that can be gained with a tempo-elastic speech synthesizer.

[1]  Gabriel Skantze,et al.  A General, Abstract Model of Incremental Dialogue Processing , 2011 .

[2]  Srinivas Bangalore,et al.  Real-time Incremental Speech-to-Speech Translation of Dialogs , 2012, NAACL.

[3]  David Schlangen,et al.  INPRO_iSS: A Component for Just-In-Time Incremental Speech Synthesis , 2012, ACL.

[4]  Michael White,et al.  Learning to Say It Well: Reranking Realizations by Predicted Synthesis Quality , 2006, ACL.

[5]  Ian McGraw,et al.  Estimating Word-Stability During Incremental Speech Recognition , 2012, INTERSPEECH.

[6]  Timo Baumann Decision tree usage for incremental parametric speech synthesis , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Timo Baumann Partial representations improve the prosody of incremental speech synthesis , 2014, INTERSPEECH.

[8]  Tomoki Toda,et al.  Constructing a speech translation system using simultaneous interpretation data , 2013, IWSLT.

[9]  Douglas Adams,et al.  Hitchhiker's Guide to the Galaxy , 1979 .

[10]  Srinivas Bangalore,et al.  Role of pausing in text-to-speech synthesis for simultaneous interpretation , 2013, SSW.

[11]  Bryan Jurish,et al.  Word and Sentence Tokenization with Hidden Markov Models , 2013, J. Lang. Technol. Comput. Linguistics.

[12]  Srinivas Bangalore,et al.  Corpus analysis of simultaneous interpretation data for improving real time speech translation , 2013, INTERSPEECH.

[13]  Hermann Ney,et al.  Speech translation: coupling of recognition and translation , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[14]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[15]  Gökhan Tür,et al.  Joint Decoding for Speech Recognition and Semantic Tagging , 2012, INTERSPEECH.

[16]  Gabriel Skantze,et al.  Towards Incremental Speech Generation in Dialogue Systems , 2010, SIGDIAL Conference.

[17]  David Schlangen,et al.  Assessing and Improving the Performance of Speech Recognition for Incremental Systems , 2009, NAACL.

[18]  Marc Schröder,et al.  The German Text-to-Speech Synthesis System MARY: A Tool for Research, Development and Teaching , 2003, Int. J. Speech Technol..

[19]  Mauro Cettolo,et al.  WIT3: Web Inventory of Transcribed and Translated Talks , 2012, EAMT.

[20]  Jason D. Williams,et al.  Stability and Accuracy in Incremental Speech Recognition , 2011, SIGDIAL Conference.

[21]  David Schlangen,et al.  TELIDA: A Package for Manipulation and Visualization of Timed Linguistic Data , 2009, SIGDIAL Conference.