Entity Resolution with Markov Logic

Entity resolution is the problem of determining which records in a database refer to the same entities, and is a crucial and expensive step in the data mining process. Interest in it has grown rapidly, and many approaches have been proposed. However, they tend to address only isolated aspects of the problem, and are often ad hoc. This paper proposes a well-founded, integrated solution to the entity resolution problem based on Markov logic. Markov logic combines first-order logic and probabilistic graphical models by attaching weights to first-order formulas, and viewing them as templates for features of Markov networks. We show how a number of previous approaches can be formulated and seamlessly combined in Markov logic, and how the resulting learning and inference problems can be solved efficiently. Experiments on two citation databases show the utility of this approach, and evaluate the contribution of the different components.

[1]  H B NEWCOMBE,et al.  Automatic linkage of vital records. , 1959, Science.

[2]  Michael R. Genesereth,et al.  Logical foundations of artificial intelligence , 1987 .

[3]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems - networks of plausible inference , 1991, Morgan Kaufmann series in representation and reasoning.

[4]  Bart Selman,et al.  Local search strategies for satisfiability testing , 1993, Cliques, Coloring, and Satisfiability.

[5]  Dan Roth,et al.  On the Hardness of Approximate Reasoning , 1993, IJCAI.

[6]  Salvatore J. Stolfo,et al.  The merge/purge problem for large databases , 1995, SIGMOD '95.

[7]  Bart Selman,et al.  A general stochastic approach to solving problems with hard and soft constraints , 1996, Satisfiability Problem: Theory and Applications.

[8]  Charles Elkan,et al.  An Efficient Domain-Independent Algorithm for Detecting Approximately Duplicate Database Records , 1997, DMKD.

[9]  John D. Lafferty,et al.  Inducing Features of Random Fields , 1995, IEEE Trans. Pattern Anal. Mach. Intell..

[10]  Stuart J. Russell,et al.  Object Identification: A Bayesian Analysis with Application to Traffic Surveillance , 1998, Artif. Intell..

[11]  William E. Winkler,et al.  The State of Record Linkage and Current Research Problems , 1999 .

[12]  Henry A. Kautz,et al.  Hardening soft information sources , 2000, KDD '00.

[13]  Andrew McCallum,et al.  Efficient clustering of high-dimensional data sets with application to reference matching , 2000, KDD '00.

[14]  Using q-grams in a DBMS for Approximate String Processing , 2001, IEEE Data Eng. Bull..

[15]  Craig A. Knoblock,et al.  Learning object identification rules for information integration , 2001, Inf. Syst..

[16]  D K Smith,et al.  Numerical Optimization , 2001, J. Oper. Res. Soc..

[17]  Anuradha Bhamidipaty,et al.  Interactive deduplication using active learning , 2002, KDD.

[18]  Stuart J. Russell,et al.  Identity Uncertainty and Citation Matching , 2002, NIPS.

[19]  William W. Cohen,et al.  Learning to match and cluster large high-dimensional data sets for data integration , 2002, KDD.

[20]  Michael Collins,et al.  Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms , 2002, EMNLP.

[21]  Craig A. Knoblock,et al.  Learning domain-independent string transformation weights for high accuracy object identification , 2002, KDD.

[22]  Constance de Koning,et al.  Editors , 2003, Annals of Emergency Medicine.

[23]  William W. Cohen,et al.  A Comparison of String Metrics for Matching Names and Records , 2003 .

[24]  Raymond J. Mooney,et al.  Adaptive duplicate detection using learnable string similarity measures , 2003, KDD '03.

[25]  Mikhail Bilenko and Raymond J. Mooney,et al.  On Evaluation and Training-Set Construction for Duplicate Detection , 2003 .

[26]  P. Ivax,et al.  A THEORY FOR RECORD LINKAGE , 2004 .

[27]  Lise Getoor,et al.  Iterative record linkage for cleaning and integration , 2004, DMKD '04.

[28]  Andrew McCallum,et al.  Conditional Models of Identity Uncertainty with Application to Noun Coreference , 2004, NIPS.

[29]  Xin Li,et al.  Constraint-Based Entity Matching , 2005, AAAI.

[30]  Pedro M. Domingos,et al.  Learning the structure of Markov logic networks , 2005, ICML.

[31]  Dan Roth,et al.  Semantic Integration in Text: From Ambiguous Names to Identifiable Entities , 2005, AI Mag..

[32]  Jesse Davis,et al.  Establishing Identity Equivalence in Multi-Relational Domains , 2005 .

[33]  Pedro M. Domingos,et al.  Object Identification with Attribute-Mediated Dependences , 2005, PKDD.

[34]  Pedro M. Domingos,et al.  Discriminative Training of Markov Logic Networks , 2005, AAAI.

[35]  Andrew McCallum,et al.  Joint deduplication of multiple record types in relational data , 2005, CIKM '05.

[36]  G. Niklas Norén,et al.  A hit-miss model for duplicate detection in the WHO drug safety database , 2005, KDD '05.

[37]  Stuart J. Russell,et al.  BLOG: Probabilistic Models with Unknown Objects , 2005, IJCAI.

[38]  Jayant Madhavan,et al.  Reference reconciliation in complex information spaces , 2005, SIGMOD '05.

[39]  Stuart J. Russell,et al.  Probabilistic models with unknown objects , 2006 .

[40]  Pedro M. Domingos,et al.  Memory-Efficient Inference in Relational Domains , 2006, AAAI.

[41]  Matthew Richardson,et al.  Markov logic networks , 2006, Machine Learning.

[42]  Matthew Richardson,et al.  The Alchemy System for Statistical Relational AI: User Manual , 2007 .