Profile Based Cross-Document Coreference Using Kernelized Fuzzy Relational Clustering

Coreferencing entities across documents in a large corpus enables advanced document understanding tasks such as question answering. This paper presents a novel cross document coreference approach that leverages the profiles of entities which are constructed by using information extraction tools and reconciled by using a within-document coreference module. We propose to match the profiles by using a learned ensemble distance function comprised of a suite of similarity specialists. We develop a kernelized soft relational clustering algorithm that makes use of the learned distance function to partition the entities into fuzzy sets of identities. We compare the kernelized clustering method with a popular fuzzy relation clustering algorithm (FRC) and show 5% improvement in coreference performance. Evaluation of our proposed methods on a large benchmark disambiguation collection shows that they compare favorably with the top runs in the SemEval evaluation.

[1]  David W. Conrath,et al.  Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy , 1997, ROCLING/IJCLCLP.

[2]  Alexander J. Smola,et al.  Learning with kernels , 1998 .

[3]  Julio Gonzalo,et al.  The SemEval-2007 WePS Evaluation: Establishing a benchmark for the Web People Search Task , 2007, Fourth International Workshop on Semantic Evaluations (SemEval-2007).

[4]  Pradeep Ravikumar,et al.  A Comparison of String Distance Metrics for Name-Matching Tasks , 2003, IIWeb.

[5]  Dan Roth,et al.  Robust Reading: Identification and Tracing of Ambiguous Names , 2004, NAACL.

[6]  Breck Baldwin,et al.  Entity-Based Cross-Document Coreferencing Using the Vector Space Model , 1998, COLING.

[7]  Beatrice Lazzerini,et al.  A novel approach to fuzzy clustering based on a dissimilarity relation extracted from data using a TS system , 2006, Pattern Recognit..

[8]  Gerardo Beni,et al.  A Validity Measure for Fuzzy Clustering , 1991, IEEE Trans. Pattern Anal. Mach. Intell..

[9]  Dan Roth,et al.  Identification and Tracing of Ambiguous Names: Discriminative and Generative Approaches , 2004, AAAI.

[10]  Alex Baron,et al.  Who is Who and What is What: Experiments in Cross-Document Co-Reference , 2008, EMNLP.

[11]  Cheng Niu,et al.  Weakly Supervised Learning for Cross-document Person Name Disambiguation Supported by Information Extraction , 2004, ACL.

[12]  Jian Su,et al.  An Entity-Mention Model for Coreference Resolution with Inductive Logic Programming , 2008, ACL.

[13]  James Allan,et al.  Cross-Document Coreference on a Large Scale Corpus , 2004, NAACL.

[14]  Razvan C. Bunescu,et al.  Using Encyclopedic Knowledge for Named entity Disambiguation , 2006, EACL.

[15]  Ying Chen,et al.  Towards Robust Unsupervised Personal Name Disambiguation , 2007, EMNLP-CoNLL.

[16]  Claire Gardent,et al.  Improving Machine Learning Approaches to Coreference Resolution , 2002, ACL.

[17]  James C. Bezdek,et al.  Pattern Recognition with Fuzzy Objective Function Algorithms , 1981, Advanced Applications in Pattern Recognition.

[18]  Sarah M. Taylor Information Extraction Tools: Deciphering Human Language , 2004, IT Prof..

[19]  Dao-Qiang Zhang,et al.  Clustering Incomplete Data Using Kernel-Based Fuzzy C-means Algorithm , 2003, Neural Processing Letters.

[20]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[21]  Beatrice Lazzerini,et al.  A new fuzzy relational clustering algorithm based on the fuzzy C-means algorithm , 2005, Soft Comput..

[22]  Rajesh N. Davé,et al.  Robust fuzzy clustering of relational data , 2002, IEEE Trans. Fuzzy Syst..

[23]  S.M. Taylor Deciphering human language [information extraction] , 2004, IT Professional.

[24]  David Yarowsky,et al.  Unsupervised Personal Name Disambiguation , 2003, CoNLL.

[25]  Yoram Singer,et al.  Using and combining predictors that specialize , 1997, STOC '97.

[26]  Xiaojun Wan,et al.  Person resolution in person search results: WebHawk , 2005, CIKM '05.