A Latent Dirichlet Model for Unsupervised Entity Resolution

Entity resolution has received considerable attention in recent years. Given many references to underlying entities, the goal is to predict which references correspond to the same entity. We show how to extend the Latent Dirichlet Allocation model for this task and propose a probabilistic model for collective entity resolution for relational domains where references are connected to each other. Our approach differs from other recently proposed entity resolution approaches in that it is a) generative, b) does not make pair-wise decisions and c) captures relations between entities through a hidden group variable. We propose a novel sampling algorithm for collective entity resolution which is unsupervised and also takes entity relations into account. Additionally, we do not assume the domain of entities to be known and show how to infer the number of entities from the data. We demonstrate the utility and practicality of our relational entity resolution approach for author resolution in two real-world bibliographic datasets. In addition, we present preliminary results on characterizing conditions under which relational information is useful.

[1]  Pedro M. Domingos Multi-Relational Record Linkage , 2003 .

[2]  Andrew McCallum,et al.  Conditional Models of Identity Uncertainty with Application to Noun Coreference , 2004, NIPS.

[3]  Jiawei Han,et al.  Object Matching for Information Integration: A Profiler-Based Approach , 2003, IIWeb.

[4]  Charles Elkan,et al.  The Field Matching Problem: Algorithms and Applications , 1996, KDD.

[5]  Tom Minka,et al.  Expectation Propagation for approximate Bayesian inference , 2001, UAI.

[6]  William E. Winkler,et al.  Methods for Record Linkage and Bayesian Networks , 2002 .

[7]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[8]  Daniel Marcu,et al.  A Bayesian Model for Supervised Clustering with the Dirichlet Process Prior , 2005, J. Mach. Learn. Res..

[9]  C. Antoniak Mixtures of Dirichlet Processes with Applications to Bayesian Nonparametric Problems , 1974 .

[10]  Craig A. Knoblock,et al.  Learning object identification rules for information integration , 2001, Inf. Syst..

[11]  Stuart J. Russell,et al.  Identity Uncertainty and Citation Matching , 2002, NIPS.

[12]  Salvatore J. Stolfo,et al.  The merge/purge problem for large databases , 1995, SIGMOD '95.

[13]  Micah Adler,et al.  Clustering Relational Data Using Attribute and Link Information , 2003 .

[14]  Surajit Chaudhuri,et al.  Eliminating Fuzzy Duplicates in Data Warehouses , 2002, VLDB.

[15]  Ivan P. Fellegi,et al.  A Theory for Record Linkage , 1969 .

[16]  Dmitri V. Kalashnikov,et al.  Exploiting Relationships for Domain-Independent Data Cleaning , 2005, SDM.

[17]  Thomas Hofmann,et al.  Probabilistic Latent Semantic Analysis , 1999, UAI.

[18]  Dan Roth,et al.  Robust Reading: Identification and Tracing of Ambiguous Names , 2004, NAACL.

[19]  Pradeep Ravikumar,et al.  A Hierarchical Graphical Model for Record Linkage , 2004, UAI.

[20]  Andrew McCallum,et al.  Efficient clustering of high-dimensional data sets with application to reference matching , 2000, KDD '00.

[21]  C. Lee Giles,et al.  CiteSeer: an automatic citation indexing system , 1998, DL '98.

[22]  Thomas L. Griffiths,et al.  The Author-Topic Model for Authors and Documents , 2004, UAI.

[23]  Lise Getoor,et al.  Iterative record linkage for cleaning and integration , 2004, DMKD '04.

[24]  William W. Cohen,et al.  Learning to match and cluster large high-dimensional data sets for data integration , 2002, KDD.

[25]  Raymond J. Mooney,et al.  Adaptive duplicate detection using learnable string similarity measures , 2003, KDD '03.

[26]  Anuradha Bhamidipaty,et al.  Interactive deduplication using active learning , 2002, KDD.

[27]  T. Ferguson A Bayesian Analysis of Some Nonparametric Problems , 1973 .

[28]  M. Escobar,et al.  Markov Chain Sampling Methods for Dirichlet Process Mixture Models , 2000 .

[29]  Yiming Yang,et al.  Stochastic link and group detection , 2002, AAAI/IAAI.

[30]  Pradeep Ravikumar,et al.  A Comparison of String Distance Metrics for Name-Matching Tasks , 2003, IIWeb.

[31]  Michael I. Jordan,et al.  Variational methods for the Dirichlet process , 2004, ICML.

[32]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[33]  William E. Winkler,et al.  The State of Record Linkage and Current Research Problems , 1999 .

[34]  Andrew McCallum,et al.  A Conditional Model of Deduplication for Multi-Type Relational Data , 2005 .

[35]  Jayant Madhavan,et al.  Reference reconciliation in complex information spaces , 2005, SIGMOD '05.

[36]  Rajeev Motwani,et al.  Robust and efficient fuzzy match for online data cleaning , 2003, SIGMOD '03.