Solving the "Who's Mark Johnson Puzzle": Information Extraction Based Cross Document Coreference

Cross Document Coreference (CDC) is the problem of resolving the underlying identity of entities across multiple documents and is a major step for document understanding. We develop a framework to efficiently determine the identity of a person based on extracted information, which includes unary properties such as gender and title, as well as binary relationships with other named entities such as co-occurrence and geo-locations. At the heart of our approach is a suite of similarity functions (specialists) for matching relationships and a relational density-based clustering algorithm that delineates name clusters based on pairwise similarity. We demonstrate the effectiveness of our methods on the WePS benchmark datasets and point out future research directions.

[1]  David W. Conrath,et al.  Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy , 1997, ROCLING/IJCLCLP.

[2]  David Yarowsky,et al.  Unsupervised Personal Name Disambiguation , 2003, CoNLL.

[3]  Pradeep Ravikumar,et al.  A Comparison of String Distance Metrics for Name-Matching Tasks , 2003, IIWeb.

[4]  James Allan,et al.  Cross-Document Coreference on a Large Scale Corpus , 2004, NAACL.

[5]  Julio Gonzalo,et al.  The SemEval-2007 WePS Evaluation: Establishing a benchmark for the Web People Search Task , 2007, Fourth International Workshop on Semantic Evaluations (SemEval-2007).

[6]  Claire Gardent,et al.  Improving Machine Learning Approaches to Coreference Resolution , 2002, ACL.

[7]  Sarah M. Taylor Information Extraction Tools: Deciphering Human Language , 2004, IT Prof..

[8]  Yoram Singer,et al.  Using and combining predictors that specialize , 1997, STOC '97.

[9]  Ying Chen,et al.  Towards Robust Unsupervised Personal Name Disambiguation , 2007, EMNLP-CoNLL.

[10]  C. Lee Giles,et al.  Efficient Name Disambiguation for Large-Scale Databases , 2006, PKDD.

[11]  Dan Roth,et al.  Robust Reading: Identification and Tracing of Ambiguous Names , 2004, NAACL.

[12]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[13]  S.M. Taylor,et al.  Deciphering human language [information extraction] , 2004, IT Professional.

[14]  Breck Baldwin,et al.  Entity-Based Cross-Document Coreferencing Using the Vector Space Model , 1998, COLING.