Cross-lingual Coreference Resolution : A New Task for Multilingual Comparable Corpora

We introduce cross-lingual coreference resolution, the task of grouping entity mentions with a common referent in a multilingual corpus. Information, especially on the web, is increasingly multilingual. We would like to track entity references across languages without machine translation, which is expensive and unavailable for many language pairs. Therefore, we develop a set of models that rely on decreasing levels of parallel resources: a bitext, a bilingual lexicon, and a parallel name list. We propose baselines, provide experimental results, and analyze sources of error. Across a range of metrics, we find that even our lowest resource model gives a 2.5% F1 absolute improvement over the strongest baseline. Our results present a positive outlook for crosslingual coreference resolution even in low resource languages. We are releasing our crosslingual annotations for the ACE2008 ArabicEnglish evaluation corpus.

[1]  Andrew McCallum,et al.  Large-Scale Cross-Document Coreference Using Distributed Inference and Hierarchical Models , 2011, ACL.

[2]  Heeyoung Lee,et al.  A Multi-Pass Sieve for Coreference Resolution , 2010, EMNLP.

[3]  Mark Dredze,et al.  Streaming Cross Document Entity Coreference Resolution , 2010, COLING.

[4]  Gerhard Weikum,et al.  Untangling the Cross-Lingual Link Structure of Wikipedia , 2010, ACL.

[5]  Eduard H. Hovy,et al.  Coreference Resolution across Corpora: Languages, Coding Schemes, and Preprocessing Information , 2010, ACL.

[6]  Alexandre Klementiev,et al.  Using Mechanical Turk to Annotate Lexicons for Less Commonly Used Languages , 2010, Mturk@HLT-NAACL.

[7]  Christopher D. Manning,et al.  Improved Models of Distortion Cost for Statistical Machine Translation , 2010, NAACL.

[8]  Mitesh M. Khapra,et al.  Improving the Multilingual User Experience of Wikipedia Using Cross-Language Name Search , 2010, NAACL.

[9]  Dan Klein,et al.  Coreference Resolution in a Modular, Entity-Centered Model , 2010, NAACL.

[10]  Daniel Jurafsky,et al.  Phrasal: A Statistical Machine Translation Toolkit for Exploring New Model Features , 2010, NAACL.

[11]  Paul McNamee,et al.  An Evaluation of Technologies for Knowledge Base Population , 2010, LREC.

[12]  Yannick Versley,et al.  SemEval-2010 Task 1: Coreference Resolution in Multiple Languages , 2009, *SEMEVAL.

[13]  Chris Callison-Burch,et al.  Transliterating From All Languages , 2010, AMTA.

[14]  Andrew McCallum,et al.  Polylingual Topic Models , 2009, EMNLP.

[15]  David Yarowsky,et al.  Arabic cross-document coreference detection , 2009, ACL 2009.

[16]  J. Scott McCarley Cross language name matching , 2009, SIGIR.

[17]  Ari Rappoport,et al.  The NVI Clustering Evaluation Measure , 2009, CoNLL.

[18]  David Yarowsky,et al.  Cross-Document Coreference Resolution: A Key Technology for Learning by Reading , 2009, AAAI Spring Symposium: Learning by Reading and Learning to Read.

[19]  Christopher D. Manning,et al.  Stanford University’s Arabic-to-English Statistical Machine Translation System for the 2009 NIST MT Open Evaluation , 2009 .

[20]  James Allan,et al.  Cross-document cross-lingual coreference retrieval , 2008, CIKM '08.

[21]  Dan Roth,et al.  Understanding the Value of Features for Coreference Resolution , 2008, EMNLP.

[22]  Alex Baron,et al.  Who is Who and What is What: Experiments in Cross-Document Co-Reference , 2008, EMNLP.

[23]  Christopher D. Manning,et al.  A Simple and Effective Hierarchical Phrase Reordering Model , 2008, EMNLP.

[24]  Imed Zitouni,et al.  Mention Detection Crossing the Language Barrier , 2008, EMNLP.

[25]  Mark A. Przybocki,et al.  Linguistic Resources and Evaluation Techniques for Evaluation of Cross-Document Automatic Content Extraction , 2008, LREC.

[26]  Christopher D. Manning,et al.  Enforcing Transitivity in Coreference Resolution , 2008, ACL.

[27]  Kevin Knight,et al.  Name Translation in Statistical Machine Translation - Learning When to Transliterate , 2008, ACL.

[28]  Dan Klein,et al.  Learning Bilingual Lexicons from Monolingual Corpora , 2008, ACL.

[29]  Zhiyi Song,et al.  Entity Translation and Alignment in the ACE-07 ET Task , 2008, LREC.

[30]  Automatic Content Extraction 2008 Evaluation Plan ( ACE 08 ) Assessment of Detection and Recognition of Entities and Relations Within and Across Documents , 2008 .

[31]  Walid Magdy,et al.  Arabic Cross-Document Person Name Normalization , 2007, SEMITIC@ACL.

[32]  Jianfeng Gao,et al.  Scalable training of L1-regularized log-linear models , 2007, ICML '07.

[33]  Silviu Cucerzan,et al.  Large-Scale Named Entity Disambiguation Based on Wikipedia Data , 2007, EMNLP.

[34]  Ying Chen,et al.  Towards Robust Unsupervised Personal Name Disambiguation , 2007, EMNLP-CoNLL.

[35]  Peter Christen,et al.  A Comparison of Personal Name Matching: Techniques and Practical Issues , 2006, Sixth IEEE International Conference on Data Mining - Workshops (ICDMW'06).

[36]  Ben Taskar,et al.  Alignment by Agreement , 2006, NAACL.

[37]  Andrew Freeman,et al.  Cross Linguistic Name Matching in English and Arabic , 2006, NAACL.

[38]  Razvan C. Bunescu,et al.  Using Encyclopedic Knowledge for Named entity Disambiguation , 2006, EACL.

[39]  Xiaoqiang Luo,et al.  On Coreference Resolution Performance Metrics , 2005, HLT.

[40]  Xiaoqiang Luo,et al.  Multi-Lingual Coreference Resolution With Syntactic Features , 2005, HLT/EMNLP.

[41]  Houfeng Wang Cross-Document Transliterated Personal Name Coreference Resolution , 2005, FSKD.

[42]  Nizar Habash,et al.  Arabic Tokenization, Part-of-Speech Tagging and Morphological Disambiguation in One Fell Swoop , 2005, ACL.

[43]  Eduard Hovy,et al.  Multi-Document Person Name Resolution , 2004 .

[44]  Xiaoqiang Luo,et al.  A Statistical Model for Multilingual Entity Detection and Tracking , 2004, NAACL.

[45]  James Allan,et al.  An Exploration of Entity Models, Collective Classification and Relation Description , 2004 .

[46]  Dominik Endres,et al.  A new metric for probability distributions , 2003, IEEE Transactions on Information Theory.

[47]  David Yarowsky,et al.  Unsupervised Personal Name Disambiguation , 2003, CoNLL.

[48]  Sanda M. Harabagiu,et al.  Multilingual Coreference Resolution , 2000, ANLP.

[49]  Breck Baldwin,et al.  Entity-Based Cross-Document Coreferencing Using the Vector Space Model , 1998, COLING.

[50]  Breck Baldwin,et al.  Algorithms for Scoring Coreference Chains , 1998 .

[51]  Kevin Knight,et al.  Machine Transliteration , 1997, CL.

[52]  William E. Winkler,et al.  Approximate String Comparison and its Effect on an Advanced Record Linkage System , 1997 .