论文信息 - Entity Clustering Across Languages - 字舞流文

Entity Clustering Across Languages

Standard entity clustering systems commonly rely on mention (string) matching, syntactic features, and linguistic resources like English WordNet. When co-referent text mentions appear in different languages, these techniques cannot be easily applied. Consequently, we develop new methods for clustering text mentions across documents and languages simultaneously, producing cross-lingual entity clusters. Our approach extends standard clustering algorithms with cross-lingual mention and context similarity measures. Crucially, we do not assume a pre-existing entity list (knowledge base), so entity characteristics are unknown. On an Arabic-English corpus that contains seven different text genres, our best model yields a 24.3% F1 gain over the baseline.

Mark Dredze | Christopher D. Manning | Spence Green | Nicholas Andrews | Matthew R. Gormley | Mark Dredze | Spence Green | Nicholas Andrews

[1] J. Scott McCarley. Cross language name matching , 2009, SIGIR.

[2] Eduard Hovy,et al. Multi-Document Person Name Resolution , 2004 .

[3] Breck Baldwin,et al. Algorithms for Scoring Coreference Chains , 1998 .

[4] Mark A. Przybocki,et al. Linguistic Resources and Evaluation Techniques for Evaluation of Cross-Document Automatic Content Extraction , 2008, LREC.

[5] Andrew McCallum,et al. Polylingual Topic Models , 2009, EMNLP.

[6] Eugene Charniak,et al. Unsupervised Learning of Name Structure From Coreference Data , 2001, NAACL.

[7] C. Antoniak. Mixtures of Dirichlet Processes with Applications to Bayesian Nonparametric Problems , 1974 .

[8] M. West,et al. Hyperparameter estimation in Dirichlet process mixture models , 1992 .

[9] Walid Magdy,et al. Arabic Cross-Document Person Name Normalization , 2007, SEMITIC@ACL.

[10] Daniel Jurafsky,et al. Phrasal: A Statistical Machine Translation Toolkit for Exploring New Model Features , 2010, NAACL.

[11] Zhiyi Song,et al. Entity Translation and Alignment in the ACE-07 ET Task , 2008, LREC.

[12] Automatic Content Extraction 2008 Evaluation Plan ( ACE 08 ) Assessment of Detection and Recognition of Entities and Relations Within and Across Documents , 2008 .

[13] Dominik Endres,et al. A new metric for probability distributions , 2003, IEEE Transactions on Information Theory.

[14] Houfeng Wang. Cross-Document Transliterated Personal Name Coreference Resolution , 2005, FSKD.

[15] Gerhard Weikum,et al. Untangling the Cross-Lingual Link Structure of Wikipedia , 2010, ACL.

[16] David Yarowsky,et al. Unsupervised Personal Name Disambiguation , 2003, CoNLL.

[17] Sanda M. Harabagiu,et al. Multilingual Coreference Resolution , 2000, ANLP.

[18] Dan Klein,et al. From Instance-level Constraints to Space-Level Constraints: Making the Most of Prior Knowledge in Data Clustering , 2002, ICML.

[19] Breck Baldwin,et al. Entity-Based Cross-Document Coreferencing Using the Vector Space Model , 1998, COLING.

[20] Ying Chen,et al. Towards Robust Unsupervised Personal Name Disambiguation , 2007, EMNLP-CoNLL.

[21] Yoram Singer,et al. Unsupervised Models for Named Entity Classification , 1999, EMNLP.

[22] James Allan,et al. An Exploration of Entity Models, Collective Classification and Relation Description , 2004 .

[23] Jianfeng Gao,et al. Scalable training of L1-regularized log-linear models , 2007, ICML '07.

[24] Douglas W. Oard,et al. Cross-Language Entity Linking , 2011, IJCNLP.

[25] David Yarowsky,et al. Cross-Document Coreference Resolution: A Key Technology for Learning by Reading , 2009, AAAI Spring Symposium: Learning by Reading and Learning to Read.

[26] Peter I. Frazier,et al. Distance dependent Chinese restaurant processes , 2009, ICML.

[27] Mitesh M. Khapra,et al. Improving the Multilingual User Experience of Wikipedia Using Cross-Language Name Search , 2010, NAACL.

[28] Andrew McCallum,et al. Large-Scale Cross-Document Coreference Using Distributed Inference and Hierarchical Models , 2011, ACL.

[29] Gerard Salton,et al. A vector space model for automatic indexing , 1975, CACM.

[30] Xiaoqiang Luo,et al. A Statistical Model for Multilingual Entity Detection and Tracking , 2004, NAACL.

[31] Michael Strube,et al. Evaluation Metrics For End-to-End Coreference Resolution Systems , 2010, SIGDIAL Conference.

[32] Peter Christen,et al. A Comparison of Personal Name Matching: Techniques and Practical Issues , 2006, Sixth IEEE International Conference on Data Mining - Workshops (ICDMW'06).

[33] Christopher D. Manning,et al. Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[34] Nizar Habash,et al. Arabic Tokenization, Part-of-Speech Tagging and Morphological Disambiguation in One Fell Swoop , 2005, ACL.

[35] Xiaoqiang Luo,et al. Multi-Lingual Coreference Resolution With Syntactic Features , 2005, HLT/EMNLP.

[36] Alex Baron,et al. Who is Who and What is What: Experiments in Cross-Document Co-Reference , 2008, EMNLP.

[37] William E. Winkler,et al. Approximate String Comparison and its Effect on an Advanced Record Linkage System , 1997 .

[38] Xiaoqiang Luo,et al. On Coreference Resolution Performance Metrics , 2005, HLT.

[39] Mark Dredze,et al. Streaming Cross Document Entity Coreference Resolution , 2010, COLING.

[40] Kevin Knight,et al. Machine Transliteration , 1997, CL.

[41] Dan Klein,et al. Accurate Unlexicalized Parsing , 2003, ACL.

[42] David Yarowsky,et al. Arabic cross-document coreference detection , 2009, ACL 2009.

[43] James Allan,et al. Cross-document cross-lingual coreference retrieval , 2008, CIKM '08.

[44] Andrew Freeman,et al. Cross Linguistic Name Matching in English and Arabic , 2006, NAACL.

[45] James Allan,et al. Cross-Document Coreference on a Large Scale Corpus , 2004, NAACL.

[46] Ari Rappoport,et al. The NVI Clustering Evaluation Measure , 2009, CoNLL.

[47] Zoubin Ghahramani,et al. Unsupervised and Constrained Dirichlet Process Mixture Models for Verb Clustering , 2009 .

[48] Christopher D. Manning,et al. A Simple and Effective Hierarchical Phrase Reordering Model , 2008, EMNLP.

[49] Micha Elsner,et al. Structured Generative Models for Unsupervised Named-Entity Clustering , 2009, HLT-NAACL.

[50] Chris Callison-Burch,et al. Transliterating From All Languages , 2010, AMTA.