Enhancing Cross Document Coreference of Web Documents with Context Similarity and Very Large Scale Text Categorization

Cross Document Coreference (CDC) is the task of constructing the coreference chain for mentions of a person across a set of documents. This work offers a holistic view of using document-level categories, sub-document level context and extracted entities and relations for the CDC task. We train a categorization component with an efficient flat algorithm using thousands of ODP categories and over a million web documents. We propose to use ranked categories as coreference information, particularly suitable for web documents that are widely different in style and content. An ensemble composite coreference function, amenable to inactive features, combines these three levels of evidence for disambiguation. A thorough feature importance study is conducted to analyze how these three components contribute to the coreference results. The overall solution is evaluated using the WePS benchmark data and demonstrate superior performance.

[1]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[2]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[3]  James Allan,et al.  Cross-Document Coreference on a Large Scale Corpus , 2004, NAACL.

[4]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[5]  S.M. Taylor,et al.  Deciphering human language [information extraction] , 2004, IT Professional.

[6]  Alex Baron,et al.  Who is Who and What is What: Experiments in Cross-Document Co-Reference , 2008, EMNLP.

[7]  Oren Etzioni,et al.  Open Information Extraction from the Web , 2007, CACM.

[8]  Andy Liaw,et al.  Classification and Regression by randomForest , 2007 .

[9]  Sarah M. Taylor Information Extraction Tools: Deciphering Human Language , 2004, IT Prof..

[10]  Breck Baldwin,et al.  Algorithms for Scoring Coreference Chains , 1998 .

[11]  Jian Huang,et al.  On updates that constrain the features' connections during learning , 2008, KDD.

[12]  Julio Gonzalo,et al.  The SemEval-2007 WePS Evaluation: Establishing a benchmark for the Web People Search Task , 2007, Fourth International Workshop on Semantic Evaluations (SemEval-2007).

[13]  Claire Cardie,et al.  Identifying Anaphoric and Non-Anaphoric Noun Phrases to Improve Coreference Resolution , 2002, COLING.

[14]  Breck Baldwin,et al.  Entity-Based Cross-Document Coreferencing Using the Vector Space Model , 1998, COLING.

[15]  Koby Crammer,et al.  A Family of Additive Online Algorithms for Category Ranking , 2003, J. Mach. Learn. Res..

[16]  C. Lee Giles,et al.  Disambiguating authors in academic publications using random forests , 2009, JCDL '09.

[17]  Xiaojun Wan,et al.  Person resolution in person search results: WebHawk , 2005, CIKM '05.

[18]  Dan Roth,et al.  Robust Reading: Identification and Tracing of Ambiguous Names , 2004, NAACL.

[19]  Ying Chen,et al.  Towards Robust Unsupervised Personal Name Disambiguation , 2007, EMNLP-CoNLL.

[20]  C. Lee Giles,et al.  Efficient Name Disambiguation for Large-Scale Databases , 2006, PKDD.

[21]  Koby Crammer,et al.  Online Passive-Aggressive Algorithms , 2003, J. Mach. Learn. Res..

[22]  Xianpei Han,et al.  Named entity disambiguation by leveraging wikipedia semantic knowledge , 2009, CIKM.

[23]  Cheng Niu,et al.  Weakly Supervised Learning for Cross-document Person Name Disambiguation Supported by Information Extraction , 2004, ACL.

[24]  C. Lee Giles,et al.  Profile Based Cross-Document Coreference Using Kernelized Fuzzy Relational Clustering , 2009, ACL/IJCNLP.

[25]  David Yarowsky,et al.  Unsupervised Personal Name Disambiguation , 2003, CoNLL.