论文信息 - Incorporating pronoun function into statistical machine translation

Incorporating pronoun function into statistical machine translation

Pronouns are used frequently in language, and perform a range of functions. Some pronouns are used to express coreference, and others are not. Languages and genres differ in how and when they use pronouns and this poses a problem for Statistical Machine Translation (SMT) systems (Le Nagard and Koehn, 2010; Hardmeier and Federico, 2010; Novák, 2011; Guillou, 2012; Weiner, 2014; Hardmeier, 2014). Attention to date has focussed on coreferential (anaphoric) pronouns with NP antecedents, which when translated from English into a language with grammatical gender, must agree with the translation of the head of the antecedent. Despite growing attention to this problem, little progress has been made, and little attention has been given to other pronouns. The central claim of this thesis is that pronouns performing different functions in text should be handled differently by SMT systems and when evaluating pronoun translation. This motivates the introduction of a new framework to categorise pronouns according to their function: Anaphoric/cataphoric reference, event reference, extra-textual reference, pleonastic, addressee reference, speaker reference, generic reference, or other function. Labelling pronouns according to their function also helps to resolve instances of functional ambiguity arising from the same pronoun in the source language having multiple functions, each with different translation requirements in the target language. The categorisation framework is used in corpus annotation, corpus analysis, SMT system development and evaluation. I have directed the annotation and conducted analyses of a parallel corpus of English-German texts called ParCor (Guillou et al., 2014), in which pronouns are manually annotated according to their function. This provides a first step toward understanding the problems that SMT systems face when translating pronouns. In the thesis, I show how analysis of manual translation can prove useful in identifying and understanding systematic differences in pronoun use between two languages and can help inform the design of SMT systems. In particular, the analysis revealed that the German translations in ParCor contain more anaphoric and pleonastic pronouns than their English originals, reflecting differences in pronoun use. This raises a particular problem for the evaluation of pronoun translation. Automatic evaluation methods that rely on reference translations to assess pronoun translation, will not be able to provide an adequate evaluation when the

Liane Kirsten Guillou | Liane Guillou

[1] Alon Lavie,et al. METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.

[2] R. Mitkov. ANAPHORA RESOLUTION: THE STATE OF THE ART , 2007 .

[3] Ondrej Bojar,et al. Failures in English-Czech Phrase-Based MT ∗ , 2010 .

[4] Andrei Popescu-Belis,et al. How Comparable are Parallel Corpora? Measuring the Distribution of General Vocabulary and Connectives , 2011, BUCC@ACL.

[5] Jörg Tiedemann. To Cache or Not To Cache? Experiments with Adaptive Models in Statistical Machine Translation , 2010, WMT@ACL.

[6] Pascale Fung,et al. Semantic Roles for SMT: A Hybrid Two-Pass Model , 2009, NAACL.

[7] Karin Naumann,et al. Manual for the Annotation of in-document Referential Relations , 2006 .

[8] Preslav Nakov,et al. Pronoun-Focused MT and Cross-Lingual Pronoun Prediction: Findings of the 2015 DiscoMT Shared Task on Pronoun Translation , 2015, DiscoMT@EMNLP.

[9] Shalom Lappin,et al. An Algorithm for Pronominal Anaphora Resolution , 1994, CL.

[10] Christian Hardmeier. On Statistical Machine Translation and Translation Theory , 2015, DiscoMT@EMNLP.

[11] Dan Klein,et al. Learning Accurate, Compact, and Interpretable Tree Annotation , 2006, ACL.