Constraint-Based Entity Matching

Entity matching is the problem of deciding if two given mentions in the data, such as "Helen Hunt" and "H. M. Hunt", refer to the same real-world entity. Numerous solutions have been developed, but they have not considered in depth the problem of exploiting integrity constraints that frequently exist in the domains. Examples of such constraints include "a mention with age two cannot match a mention with salary 200K" and "if two paper citations match, then their authors are likely to match in the same order". In this paper we describe a probabilistic solution to entity matching that exploits such constraints to improve matching accuracy. At the heart of the solution is a generative model that takes into account the constraints during the generation process, and provides well-defined interpretations of the constraints. We describe a novel combination of EM and relaxation labeling algorithms that efficiently learns the model, thereby matching mentions in an unsupervised way, without the need for annotated training data. Experiments on several real-world domains show that our solution can exploit constraints to significantly improve matching accuracy, by 3-12% F-1, and that the solution scales up to large data sets.

[1]  DoanAnHai,et al.  Learning to match ontologies on the Semantic Web , 2003, VLDB 2003.

[2]  Pedro M. Domingos Multi-Relational Record Linkage , 2003 .

[3]  Andrew McCallum,et al.  An Integrated, Conditional Model of Information Extraction and Coreference with Appli , 2004, UAI.

[4]  Raymond J. Mooney,et al.  Adaptive duplicate detection using learnable string similarity measures , 2003, KDD '03.

[5]  Andrew McCallum,et al.  Efficient clustering of high-dimensional data sets with application to reference matching , 2000, KDD '00.

[6]  Lise Getoor,et al.  Iterative record linkage for cleaning and integration , 2004, DMKD '04.

[7]  Sunita Sarawagi,et al.  Automatic segmentation of text into structured records , 2001, SIGMOD '01.

[8]  Jayant Madhavan,et al.  Reference reconciliation in complex information spaces , 2005, SIGMOD '05.

[9]  Eugene Agichtein,et al.  Mining reference tables for automatic text segmentation , 2004, KDD.

[10]  Pradeep Ravikumar,et al.  A Comparison of String Distance Metrics for Name-Matching Tasks , 2003, IIWeb.

[11]  Pedro M. Domingos,et al.  Learning to map between ontologies on the semantic web , 2002, WWW '02.

[12]  Dan Roth,et al.  Identification and Tracing of Ambiguous Names: Discriminative and Generative Approaches , 2004, AAAI.

[13]  Craig A. Knoblock,et al.  Learning domain-independent string transformation weights for high accuracy object identification , 2002, KDD.

[14]  Salvatore J. Stolfo,et al.  The merge/purge problem for large databases , 1995, SIGMOD '95.

[15]  Pradeep Ravikumar,et al.  A Hierarchical Graphical Model for Record Linkage , 2004, UAI.

[16]  Stuart J. Russell,et al.  Identity Uncertainty and Citation Matching , 2002, NIPS.

[17]  William W. Cohen,et al.  Learning to Match and Cluster Entity Names , 2001 .

[18]  Dayne Freitag,et al.  Multistrategy Learning for Information Extraction , 1998, ICML.

[19]  William W. Cohen Integration of heterogeneous databases without common domains using queries based on textual similarity , 1998, SIGMOD '98.

[20]  Piotr Indyk,et al.  Enhanced hypertext categorization using hyperlinks , 1998, SIGMOD '98.

[21]  Jiawei Han,et al.  Profile-Based Object Matching for Information Integration , 2003, IEEE Intell. Syst..