ParCor 1.0: A Parallel Pronoun-Coreference Corpus to Support Statistical MT

We present ParCor, a parallel corpus of texts in which pronoun coreference – reduced coreference in which pronouns are used as referringexpressions – has been annotated. The corpus is intended to be used both as a resource from which to learn systematic differences inpronoun use between languages and ultimately for developing and testing informed Statistical Machine Translation systems aimed ataddressing the problem of pronoun coreference in translation. At present, the corpus consists of a collection of parallel English-Germandocuments from two different text genres: TED Talks (transcribed planned speech), and EU Bookshop publications (written text). Alldocuments in the corpus have been manually annotated with respect to the type and location of each pronoun and, where relevant, itsantecedent. We provide details of the texts that we selected, the guidelines and tools used to support annotation and some corpus statistics.The texts in the corpus have already been translated into many languages, and we plan to expand the corpus into these other languages, aswell as other genres, in the future.

[1]  Jörg Tiedemann,et al.  Latent Anaphora Resolution for Cross-Lingual Pronoun Prediction , 2013, EMNLP.

[2]  M. N ovak Utilization of Anaphora in Machine Translation , 2011 .

[3]  Yannick Versley,et al.  A Syntax-first Approach to High-quality Morphological Analysis and Lemma Disambiguation for the TüBa-D/Z Treebank , 2010 .

[4]  Constantin Orasan,et al.  Can Projected Chains in Parallel Corpora Help Coreference Resolution? , 2011, DAARC.

[5]  Michael Halliday,et al.  Cohesion in English , 1976 .

[6]  Thomas Meyer,et al.  Machine Translation with Many Manually Labeled Discourse Connectives , 2013, DiscoMT@ACL.

[7]  Christoph Müller,et al.  Multi-level annotation of linguistic data with MMAX 2 , 2006 .

[8]  Christopher D. Manning,et al.  Generating Typed Dependency Parses from Phrase Structure Parses , 2006, LREC.

[9]  Ruslan Mitkov,et al.  Using bilingual corpora to improve pronoun resolution , 2004 .

[10]  Andrei Popescu-Belis,et al.  How Comparable are Parallel Corpora? Measuring the Distribution of General Vocabulary and Connectives , 2011, BUCC@ACL.

[11]  Karin Naumann,et al.  Manual for the Annotation of in-document Referential Relations , 2006 .

[12]  David Yarowsky,et al.  NADA: A Robust System for Non-referential Pronoun Detection , 2011, DAARC.

[13]  Marie Mikulová,et al.  Announcing Prague Czech-English Dependency Treebank 2.0 , 2012, LREC.

[14]  Dan Klein,et al.  Learning Accurate, Compact, and Interpretable Tree Annotation , 2006, ACL.

[15]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[16]  Mauro Cettolo,et al.  WIT3: Web Inventory of Transcribed and Translated Talks , 2012, EAMT.

[17]  Yannick Versley,et al.  Extending BART to Provide a Coreference Resolution System for German , 2010, LREC.

[18]  Livio Robaldo,et al.  The Penn Discourse TreeBank 2.0. , 2008, LREC.

[19]  Jacob Cohen A Coefficient of Agreement for Nominal Scales , 1960 .

[20]  Andrei Popescu-Belis,et al.  Discourse-level Annotation over Europarl for Machine Translation: Connectives and Pronouns , 2012, LREC.

[21]  M. Novák Utilization of Anaphora in Machine Translation , 2011 .

[22]  Marcello Federico,et al.  Modelling pronominal anaphora in statistical machine translation , 2010, IWSLT.

[23]  Constantin Orasan,et al.  Transferring Coreference Chains through Word Alignment , 2006, LREC.

[24]  Annotation Data Manual for the Annotation of in-document Referential Relations , 2007 .

[25]  Petr Pajas,et al.  TectoMT: Highly Modular MT System with Tectogrammatics Used as Transfer Layer , 2008, WMT@ACL.

[26]  Liane Guillou,et al.  Improving Pronoun Translation for Statistical Machine Translation , 2012, EACL.

[27]  Lynette Hirschman,et al.  Appendix F: MUC-7 Coreference Task Definition (version 3.0) , 1998, MUC.

[28]  Nianwen Xue,et al.  CoNLL-2011 Shared Task: Modeling Unrestricted Coreference in OntoNotes , 2011, CoNLL Shared Task.

[29]  Philipp Koehn,et al.  Aiding Pronoun Translation with Co-Reference Resolution , 2010, WMT@ACL.

[30]  Heeyoung Lee,et al.  Stanford’s Multi-Pass Sieve Coreference Resolution System at the CoNLL-2011 Shared Task , 2011, CoNLL Shared Task.