This paper describes a language-independent, scalable system for both challenges of cross-document co-reference: name variation and entity disambiguation. We provide system results from the ACE 2008 evaluation in both English and Arabic. Our English system's accuracy is 8.4% relative better than an exact match baseline (and 14.2% relative better over entities mentioned in more than one document). Unlike previous evaluations, ACE 2008 evaluated both name variation and entity disambiguation over naturally occurring named mentions. An information extraction engine finds document entities in text. We describe how our architecture designed for the 10K document ACE task is scalable to an even larger corpus. Our cross-document approach uses the names of entities to find an initial set of document entities that could refer to the same real world entity and then uses an agglomerative clustering algorithm to disambiguate the potentially co-referent document entities. We analyze how different aspects of our system affect performance using ablation studies over the English evaluation set. In addition to evaluating cross-document co-reference performance, we used the results of the cross-document system to improve the accuracy of within-document extraction, and measured the impact in the ACE 2008 within-document evaluation.
[1]
Breck Baldwin,et al.
Entity-Based Cross-Document Coreferencing Using the Vector Space Model
,
1998,
COLING.
[2]
Breck Baldwin,et al.
Algorithms for Scoring Coreference Chains
,
1998
.
[3]
Richard M. Schwartz,et al.
An algorithm for unsupervised topic discovery from broadcast news stories
,
2002
.
[4]
Sergey Bratus,et al.
Experiments in Multi-Modal Automatic Content Extraction
,
2001,
HLT.
[5]
David Yarowsky,et al.
Unsupervised Personal Name Disambiguation
,
2003,
CoNLL.
[6]
Steven Skiena,et al.
Identifying Co-referential Names Across Large Corpora
,
2006,
CPM.
[7]
Julio Gonzalo,et al.
A testbed for people searching strategies in the WWW
,
2005,
SIGIR '05.
[8]
James Allan,et al.
Cross-Document Coreference on a Large Scale Corpus
,
2004,
NAACL.
[9]
Julio Gonzalo,et al.
The SemEval-2007 WePS Evaluation: Establishing a benchmark for the Web People Search Task
,
2007,
Fourth International Workshop on Semantic Evaluations (SemEval-2007).