Identifying Co-referential Names Across Large Corpora

A single logical entity can be referred to by several different names over a large text corpus. We present our algorithm for finding all such co-reference sets in a large corpus. Our algorithm involves three steps: morphological similarity detection, contextual similarity analysis, and clustering. Finally, we present experimental results on over large corpus of real news text to analyze the performance our techniques.

[1]  William W. Cohen,et al.  Learning to match and cluster large high-dimensional data sets for data integration , 2002, KDD.

[2]  Christine L. Borgman,et al.  Getty's Synoname and Its Cousins: A Survey of Applications of Personal Name-Matching Algorithms , 1992, J. Am. Soc. Inf. Sci..

[3]  Breck Baldwin,et al.  Entity-Based Cross-Document Coreferencing Using the Vector Space Model , 1998, COLING.

[4]  Salvatore J. Stolfo,et al.  The merge/purge problem for large databases , 1995, SIGMOD '95.

[5]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[6]  Claire Gardent,et al.  Improving Machine Learning Approaches to Coreference Resolution , 2002, ACL.

[7]  David S. Johnson,et al.  Computers and In stractability: A Guide to the Theory of NP-Completeness. W. H Freeman, San Fran , 1979 .

[8]  Steven Skiena,et al.  Question Answering with Lydia (TREC 2005 QA Track) , 2005, TREC.

[9]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[10]  Steven Skiena,et al.  Lydia: A System for Large-Scale News Analysis , 2005, SPIRE.

[11]  Steven Skiena,et al.  Spatial Analysis of News Sources , 2006, IEEE Transactions on Visualization and Computer Graphics.

[12]  Christine L. Borgman,et al.  Getty's Synoname™ and its cousins: A survey of applications of personal name‐matching algorithms , 1992 .

[13]  James Allan,et al.  Cross-Document Coreference on a Large Scale Corpus , 2004, NAACL.

[14]  L. Philips,et al.  Hanging on the metaphone , 1990 .

[15]  Ellen Riloff,et al.  Unsupervised Learning of Contextual Role Knowledge for Coreference Resolution , 2004, NAACL.

[16]  Steven Skiena,et al.  Newspapers vs. Blogs: Who Gets the Scoop? , 2006, AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs.

[17]  David Yarowsky,et al.  Unsupervised Personal Name Disambiguation , 2003, CoNLL.