Complexity of spoken versus written language for machine translation

When machine translation researchers participate in evaluation tasks, they typically design their primary submissions using ideas that are not genre-specific. In fact, their systems look much the same from one evaluation campaign to another. In this paper, we analyze two popular genres: spoken language and written news, using publicly available corpora which stem from the popular WMT and IWSLT evaluation campaigns. We show that there is a sufficient amount of difference between the two genres that particular statistical modeling strategies should be applied to each task. We identify translation problems that are unique to each translation task and advise researchers of these phenomena to focus their efforts on the particular task.

[1]  T. Trabasso,et al.  Constructing inferences during narrative text comprehension. , 1994, Psychological review.

[2]  Philipp Koehn,et al.  Findings of the 2013 Workshop on Statistical Machine Translation , 2013, WMT@ACL.

[3]  Douglas Biber,et al.  Variation across speech and writing: Methodology , 1988 .

[4]  John Cocke,et al.  A Statistical Approach to Machine Translation , 1990, CL.

[5]  Marcello Federico,et al.  Modelling pronominal anaphora in statistical machine translation , 2010, IWSLT.

[6]  Steven T. Piantadosi,et al.  The communicative function of ambiguity in language , 2011, Cognition.

[7]  Helmut Schmidt,et al.  Probabilistic part-of-speech tagging using decision trees , 1994 .

[8]  Lee Gillam,et al.  The Linguistics of Readability: The Next Step for Word Processing , 2010, HLT-NAACL 2010.

[9]  Roland Kuhn,et al.  Translating Structured Documents , 2010, AMTA.

[10]  Maria Leonor Pacheco,et al.  of the Association for Computational Linguistics: , 2001 .

[11]  Alexandra Birch,et al.  A Quantitative Analysis of Reordering Phenomena , 2009, WMT@EACL.

[12]  Arianna Bisazza,et al.  Dynamically Shaping the Reordering Search Space of Phrase-Based Statistical Machine Translation , 2013, Transactions of the Association for Computational Linguistics.

[13]  Marcello Federico,et al.  Report on the 10th IWSLT evaluation campaign , 2013, IWSLT.

[14]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[15]  Mauro Cettolo,et al.  IRSTLM: an open source toolkit for handling large scale language models , 2008, INTERSPEECH.

[16]  Matthew G. Snover,et al.  A Study of Translation Edit Rate with Targeted Human Annotation , 2006, AMTA.

[17]  Arthur C. Graesser,et al.  Coh-Metrix: Analysis of text on cohesion and language , 2004, Behavior research methods, instruments, & computers : a journal of the Psychonomic Society, Inc.

[18]  Arthur C. Graesser,et al.  Variation in Language and Cohesion across Written and Spoken Registers , 2004 .

[19]  Marine Carpuat,et al.  Improving Statistical Machine Translation Using Word Sense Disambiguation , 2007, EMNLP.

[20]  Frank Keller,et al.  Syntactic and Semantic Factors in Processing Difficulty: An Integrated Measure , 2010, ACL.

[21]  Marine Carpuat,et al.  One Translation Per Discourse , 2009, SEW@NAACL-HLT.

[22]  Christian Hardmeier,et al.  Discourse in Statistical Machine Translation , 2014 .

[23]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[24]  Lucia Specia,et al.  Predicting Machine Translation Adequacy , 2011, MTSUMMIT.