论文信息 - Solving the "Who's Mark Johnson Puzzle": Information Extraction Based Cross Document Coreference

Solving the "Who's Mark Johnson Puzzle": Information Extraction Based Cross Document Coreference

Cross Document Coreference (CDC) is the problem of resolving the underlying identity of entities across multiple documents and is a major step for document understanding. We develop a framework to efficiently determine the identity of a person based on extracted information, which includes unary properties such as gender and title, as well as binary relationships with other named entities such as co-occurrence and geo-locations. At the heart of our approach is a suite of similarity functions (specialists) for matching relationships and a relational density-based clustering algorithm that delineates name clusters based on pairwise similarity. We demonstrate the effectiveness of our methods on the WePS benchmark datasets and point out future research directions.

C. Lee Giles | Jian Huang | Sarah M. Taylor | Jonathan L. Smith | Konstantinos A. Fotiadis

[1] David W. Conrath,et al. Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy , 1997, ROCLING/IJCLCLP.

[2] David Yarowsky,et al. Unsupervised Personal Name Disambiguation , 2003, CoNLL.

[3] Pradeep Ravikumar,et al. A Comparison of String Distance Metrics for Name-Matching Tasks , 2003, IIWeb.

[4] James Allan,et al. Cross-Document Coreference on a Large Scale Corpus , 2004, NAACL.

[5] Julio Gonzalo,et al. The SemEval-2007 WePS Evaluation: Establishing a benchmark for the Web People Search Task , 2007, Fourth International Workshop on Semantic Evaluations (SemEval-2007).

[6] Claire Gardent,et al. Improving Machine Learning Approaches to Coreference Resolution , 2002, ACL.

[7] Sarah M. Taylor. Information Extraction Tools: Deciphering Human Language , 2004, IT Prof..

[8] Yoram Singer,et al. Using and combining predictors that specialize , 1997, STOC '97.

[9] Ying Chen,et al. Towards Robust Unsupervised Personal Name Disambiguation , 2007, EMNLP-CoNLL.

[10] C. Lee Giles,et al. Efficient Name Disambiguation for Large-Scale Databases , 2006, PKDD.

[11] Dan Roth,et al. Robust Reading: Identification and Tracing of Ambiguous Names , 2004, NAACL.

[12] Hans-Peter Kriegel,et al. A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[13] S.M. Taylor,et al. Deciphering human language [information extraction] , 2004, IT Professional.

[14] Breck Baldwin,et al. Entity-Based Cross-Document Coreferencing Using the Vector Space Model , 1998, COLING.