Entity Clustering Across Languages

Standard entity clustering systems commonly rely on mention (string) matching, syntactic features, and linguistic resources like English WordNet. When co-referent text mentions appear in different languages, these techniques cannot be easily applied. Consequently, we develop new methods for clustering text mentions across documents and languages simultaneously, producing cross-lingual entity clusters. Our approach extends standard clustering algorithms with cross-lingual mention and context similarity measures. Crucially, we do not assume a pre-existing entity list (knowledge base), so entity characteristics are unknown. On an Arabic-English corpus that contains seven different text genres, our best model yields a 24.3% F1 gain over the baseline.

[1]  J. Scott McCarley Cross language name matching , 2009, SIGIR.

[2]  Eduard Hovy,et al.  Multi-Document Person Name Resolution , 2004 .

[3]  Breck Baldwin,et al.  Algorithms for Scoring Coreference Chains , 1998 .

[4]  Mark A. Przybocki,et al.  Linguistic Resources and Evaluation Techniques for Evaluation of Cross-Document Automatic Content Extraction , 2008, LREC.

[5]  Andrew McCallum,et al.  Polylingual Topic Models , 2009, EMNLP.

[6]  Eugene Charniak,et al.  Unsupervised Learning of Name Structure From Coreference Data , 2001, NAACL.

[7]  C. Antoniak Mixtures of Dirichlet Processes with Applications to Bayesian Nonparametric Problems , 1974 .

[8]  M. West,et al.  Hyperparameter estimation in Dirichlet process mixture models , 1992 .

[9]  Walid Magdy,et al.  Arabic Cross-Document Person Name Normalization , 2007, SEMITIC@ACL.

[10]  Daniel Jurafsky,et al.  Phrasal: A Statistical Machine Translation Toolkit for Exploring New Model Features , 2010, NAACL.

[11]  Zhiyi Song,et al.  Entity Translation and Alignment in the ACE-07 ET Task , 2008, LREC.

[12]  Automatic Content Extraction 2008 Evaluation Plan ( ACE 08 ) Assessment of Detection and Recognition of Entities and Relations Within and Across Documents , 2008 .

[13]  Dominik Endres,et al.  A new metric for probability distributions , 2003, IEEE Transactions on Information Theory.

[14]  Houfeng Wang Cross-Document Transliterated Personal Name Coreference Resolution , 2005, FSKD.

[15]  Gerhard Weikum,et al.  Untangling the Cross-Lingual Link Structure of Wikipedia , 2010, ACL.

[16]  David Yarowsky,et al.  Unsupervised Personal Name Disambiguation , 2003, CoNLL.

[17]  Sanda M. Harabagiu,et al.  Multilingual Coreference Resolution , 2000, ANLP.

[18]  Dan Klein,et al.  From Instance-level Constraints to Space-Level Constraints: Making the Most of Prior Knowledge in Data Clustering , 2002, ICML.

[19]  Breck Baldwin,et al.  Entity-Based Cross-Document Coreferencing Using the Vector Space Model , 1998, COLING.

[20]  Ying Chen,et al.  Towards Robust Unsupervised Personal Name Disambiguation , 2007, EMNLP-CoNLL.

[21]  Yoram Singer,et al.  Unsupervised Models for Named Entity Classification , 1999, EMNLP.

[22]  James Allan,et al.  An Exploration of Entity Models, Collective Classification and Relation Description , 2004 .

[23]  Jianfeng Gao,et al.  Scalable training of L1-regularized log-linear models , 2007, ICML '07.

[24]  Douglas W. Oard,et al.  Cross-Language Entity Linking , 2011, IJCNLP.

[25]  David Yarowsky,et al.  Cross-Document Coreference Resolution: A Key Technology for Learning by Reading , 2009, AAAI Spring Symposium: Learning by Reading and Learning to Read.

[26]  Peter I. Frazier,et al.  Distance dependent Chinese restaurant processes , 2009, ICML.

[27]  Mitesh M. Khapra,et al.  Improving the Multilingual User Experience of Wikipedia Using Cross-Language Name Search , 2010, NAACL.

[28]  Andrew McCallum,et al.  Large-Scale Cross-Document Coreference Using Distributed Inference and Hierarchical Models , 2011, ACL.

[29]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[30]  Xiaoqiang Luo,et al.  A Statistical Model for Multilingual Entity Detection and Tracking , 2004, NAACL.

[31]  Michael Strube,et al.  Evaluation Metrics For End-to-End Coreference Resolution Systems , 2010, SIGDIAL Conference.

[32]  Peter Christen,et al.  A Comparison of Personal Name Matching: Techniques and Practical Issues , 2006, Sixth IEEE International Conference on Data Mining - Workshops (ICDMW'06).

[33]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[34]  Nizar Habash,et al.  Arabic Tokenization, Part-of-Speech Tagging and Morphological Disambiguation in One Fell Swoop , 2005, ACL.

[35]  Xiaoqiang Luo,et al.  Multi-Lingual Coreference Resolution With Syntactic Features , 2005, HLT/EMNLP.

[36]  Alex Baron,et al.  Who is Who and What is What: Experiments in Cross-Document Co-Reference , 2008, EMNLP.

[37]  William E. Winkler,et al.  Approximate String Comparison and its Effect on an Advanced Record Linkage System , 1997 .

[38]  Xiaoqiang Luo,et al.  On Coreference Resolution Performance Metrics , 2005, HLT.

[39]  Mark Dredze,et al.  Streaming Cross Document Entity Coreference Resolution , 2010, COLING.

[40]  Kevin Knight,et al.  Machine Transliteration , 1997, CL.

[41]  Dan Klein,et al.  Accurate Unlexicalized Parsing , 2003, ACL.

[42]  David Yarowsky,et al.  Arabic cross-document coreference detection , 2009, ACL 2009.

[43]  James Allan,et al.  Cross-document cross-lingual coreference retrieval , 2008, CIKM '08.

[44]  Andrew Freeman,et al.  Cross Linguistic Name Matching in English and Arabic , 2006, NAACL.

[45]  James Allan,et al.  Cross-Document Coreference on a Large Scale Corpus , 2004, NAACL.

[46]  Ari Rappoport,et al.  The NVI Clustering Evaluation Measure , 2009, CoNLL.

[47]  Zoubin Ghahramani,et al.  Unsupervised and Constrained Dirichlet Process Mixture Models for Verb Clustering , 2009 .

[48]  Christopher D. Manning,et al.  A Simple and Effective Hierarchical Phrase Reordering Model , 2008, EMNLP.

[49]  Micha Elsner,et al.  Structured Generative Models for Unsupervised Named-Entity Clustering , 2009, HLT-NAACL.

[50]  Chris Callison-Burch,et al.  Transliterating From All Languages , 2010, AMTA.