Collective entity resolution in relational data

Many databases contain uncertain and imprecise references to real-world entities. The absence of identifiers for the underlying entities often results in a database which contains multiple references to the same entity. This can lead not only to data redundancy, but also inaccuracies in query processing and knowledge extraction. These problems can be alleviated through the use of entity resolution. Entity resolution involves discovering the underlying entities and mapping each database reference to these entities. Traditionally, entities are resolved using pairwise similarity over the attributes of references. However, there is often additional relational information in the data. Specifically, references to different entities may cooccur. In these cases, collective entity resolution, in which entities for cooccurring references are determined jointly rather than independently, can improve entity resolution accuracy. We propose a novel relational clustering algorithm that uses both attribute and relational information for determining the underlying domain entities, and we give an efficient implementation. We investigate the impact that different relational similarity measures have on entity resolution quality. We evaluate our collective entity resolution algorithm on multiple real-world databases. We show that it improves entity resolution performance over both attribute-based baselines and over algorithms that consider relational information but do not resolve entities collectively. In addition, we perform detailed experiments on synthetically generated data to identify data characteristics that favor collective relational resolution over purely attribute-based algorithms.

[1]  H B NEWCOMBE,et al.  Automatic linkage of vital records. , 1959, Science.

[2]  Salvatore J. Stolfo,et al.  The merge/purge problem for large databases , 1995, SIGMOD '95.

[3]  Charles Elkan,et al.  The Field Matching Problem: Algorithms and Applications , 1996, KDD.

[4]  Charles Elkan,et al.  An Efficient Domain-Independent Algorithm for Detecting Approximately Duplicate Database Records , 1997, DMKD.

[5]  William W. Cohen Integration of heterogeneous databases without common domains using queries based on textual similarity , 1998, SIGMOD '98.

[6]  Peter N. Yianilos,et al.  Learning String-Edit Distance , 1996, IEEE Trans. Pattern Anal. Mach. Intell..

[7]  C. Lee Giles,et al.  CiteSeer: an automatic citation indexing system , 1998, DL '98.

[8]  William E. Winkler,et al.  The State of Record Linkage and Current Research Problems , 1999 .

[9]  William W. Cohen Data integration using similarity joins and a word-based information representation language , 2000, TOIS.

[10]  Andrew McCallum,et al.  Efficient clustering of high-dimensional data sets with application to reference matching , 2000, KDD '00.

[11]  Craig A. Knoblock,et al.  Learning object identification rules for information integration , 2001, Inf. Syst..

[12]  Gonzalo Navarro,et al.  A guided tour to approximate string matching , 2001, CSUR.

[13]  Anuradha Bhamidipaty,et al.  Interactive deduplication using active learning , 2002, KDD.

[14]  Stuart J. Russell,et al.  Identity Uncertainty and Citation Matching , 2002, NIPS.

[15]  William W. Cohen,et al.  Learning to match and cluster large high-dimensional data sets for data integration , 2002, KDD.

[16]  William E. Winkler,et al.  Methods for Record Linkage and Bayesian Networks , 2002 .

[17]  Surajit Chaudhuri,et al.  Eliminating Fuzzy Duplicates in Data Warehouses , 2002, VLDB.

[18]  Rajeev Motwani,et al.  Robust and efficient fuzzy match for online data cleaning , 2003, SIGMOD '03.

[19]  Luis Gravano,et al.  Text joins for data cleansing and integration in an RDBMS , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[20]  Lada A. Adamic,et al.  Friends and neighbors on the Web , 2003, Soc. Networks.

[21]  Pradeep Ravikumar,et al.  A Comparison of String Distance Metrics for Name-Matching Tasks , 2003, IIWeb.

[22]  Pradeep Ravikumar,et al.  Adaptive Name Matching in Information Integration , 2003, IEEE Intell. Syst..

[23]  Pedro M. Domingos Multi-Relational Record Linkage , 2003 .

[24]  Raymond J. Mooney,et al.  Adaptive duplicate detection using learnable string similarity measures , 2003, KDD '03.

[25]  Micah Adler,et al.  Clustering Relational Data Using Attribute and Link Information , 2003 .

[26]  P. Ivax,et al.  A THEORY FOR RECORD LINKAGE , 2004 .

[27]  Pradeep Ravikumar,et al.  A Hierarchical Graphical Model for Record Linkage , 2004, UAI.

[28]  Lise Getoor,et al.  Iterative record linkage for cleaning and integration , 2004, DMKD '04.

[29]  Andrew McCallum,et al.  Conditional Models of Identity Uncertainty with Application to Noun Coreference , 2004, NIPS.

[30]  Lise Getoor,et al.  Entity Resolution in Graphs , 2005 .

[31]  Dan Roth,et al.  Semantic Integration in Text: From Ambiguous Names to Identifiable Entities , 2005, AI Mag..

[32]  Jayant Madhavan,et al.  Reference reconciliation in complex information spaces , 2005, SIGMOD '05.

[33]  Dmitri V. Kalashnikov,et al.  Exploiting Relationships for Domain-Independent Data Cleaning , 2005, SDM.

[34]  Lise Getoor,et al.  Relational clustering for multi-type entity resolution , 2005, MRDM '05.

[35]  Lise Getoor,et al.  A Latent Dirichlet Model for Unsupervised Entity Resolution , 2005, SDM.

[36]  Lawrence B. Holder,et al.  Mining Graph Data , 2006 .

[37]  Lise Getoor,et al.  Query-time entity resolution , 2006, KDD '06.

[38]  Jon M. Kleinberg,et al.  The link-prediction problem for social networks , 2007, J. Assoc. Inf. Sci. Technol..

[39]  Jiawei Han,et al.  ACM Transactions on Knowledge Discovery from Data: Introduction , 2007 .

[40]  Jennifer Widom,et al.  Swoosh: a generic approach to entity resolution , 2008, The VLDB Journal.

[41]  A. John MINING GRAPH DATA , 2022 .