Constructing a speech translation system using simultaneous interpretation data

There has been a fair amount of work on automatic speech translation systems that translate in real-time, serving as a computerized version of a simultaneous interpreter. It has been noticed in the field of translation studies that simultaneous interpreters perform a number of tricks to make the content easier to understand in real-time, including dividing their translations into small chunks, or summarizing less important content. However, the majority of previous work has not specifically considered this fact, simply using translation data (made by translators) for learning of the machine translation system. In this paper, we examine the possibilities of additionally incorporating simultaneous interpretation data (made by simultaneous interpreters) in the learning process. First we collect simultaneous interpretation data from professional simultaneous interpreters of three levels, and perform an analysis of the data. Next, we incorporate the simultaneous interpretation data in the learning of the machine translation system. As a result, the translation style of the system becomes more similar to that of a highly experienced simultaneous interpreter. We also find that according to automatic evaluation metrics, our system achieves performance similar to that of a simultaneous interpreter that has 1 year of experience.

[1]  Srinivas Bangalore,et al.  Corpus analysis of simultaneous interpretation data for improving real time speech translation , 2013, INTERSPEECH.

[2]  Roderick Jones,et al.  Conference Interpreting Explained , 2014 .

[3]  Yasuyoshi Inagaki,et al.  Incremental Japanese Spoken Language Generation in Simultaneous Machine Interpretation , 2004 .

[4]  Alexander H. Waibel,et al.  Automatic translation from parallel speech: Simultaneous interpretation as MT training data , 2009, 2009 IEEE Workshop on Automatic Speech Recognition & Understanding.

[5]  Tomoki Toda,et al.  Collection of a Simultaneous Translation Corpus for Comparative Analysis , 2014, LREC.

[6]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[7]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[8]  Xiaoyi Ma,et al.  Champollion: A Robust Parallel Text Sentence Aligner , 2006, LREC.

[9]  Shigeki Matsubara,et al.  CIAIR Simultaneous Interpretation Corpus , 2004 .

[10]  Chris Callison-Burch,et al.  Open Source Toolkit for Statistical Machine Translation: Factored Translation Models and Lattice Decoding , 2006 .

[11]  Tomoki Toda,et al.  Simple, lexicalized choice of translation timing for simultaneous speech translation , 2013, INTERSPEECH.

[12]  Arianna Bisazza,et al.  Fill-up versus interpolation methods for phrase-based SMT adaptation , 2011, IWSLT.

[13]  Graham Neubig,et al.  Pointwise Prediction for Robust, Adaptable Japanese Morphological Analysis , 2011, ACL.

[14]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[15]  Srinivas Bangalore,et al.  Real-time Incremental Speech-to-Speech Translation of Dialogs , 2012, NAACL.

[16]  Franz Josef Och,et al.  Minimum Error Rate Training in Statistical Machine Translation , 2003, ACL.

[17]  Kevin Duh,et al.  Automatic Evaluation of Translation Quality for Distant Language Pairs , 2010, EMNLP.

[18]  Hermann Ney,et al.  A Systematic Comparison of Various Statistical Alignment Models , 2003, CL.