A framework for entity resolution with efficient blocking

In applications of Web data integration, we frequently need to identify whether data objects in different data sources represent the same entity in the real world. This problem is known as entity resolution. In this paper, we propose a generic framework for entity resolution for relational data sets, called BARM, consisting of the Blocker, Attribute matchers and the Record Matcher. BARM is convenient for different blocking and matching algorithms to fit into it. For the blocker, we apply the SPectrAl Neighborhood (SPAN), a state-of-the-art blocking algorithm, to our data sets and show that SPAN is effective and efficient. For attribute matchers, we propose the Context Sensitive Value Matching Library (CSVML) for matching attribute values and also an approach to evaluate the goodness of matching functions. CSVML takes the meaning and context of attribute values into consideration and therefore has good performance, as shown in experimental results. We adopt Bayesian network as the record matcher in the framework and propose a method of inference from Bayesian network based on Markov blanket of the network. As a comparison, we also apply three other classifiers, including Decision Tree, Support Vector Machines, and the Naive Bayes classifier to our data sets. Experiments show that Bayesian network is advantageous in the book domain.

[1]  Weiyi Meng,et al.  A Latent Topic Model for Complete Entity Resolution , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[2]  Raymond J. Mooney,et al.  Adaptive duplicate detection using learnable string similarity measures , 2003, KDD '03.

[3]  Lise Getoor,et al.  Iterative record linkage for cleaning and integration , 2004, DMKD '04.

[4]  William W. Cohen,et al.  Learning to match and cluster large high-dimensional data sets for data integration , 2002, KDD.

[5]  Carlo Batini,et al.  Data Quality: Concepts, Methodologies and Techniques , 2006, Data-Centric Systems and Applications.

[6]  Ivan P. Fellegi,et al.  A Theory for Record Linkage , 1969 .

[7]  อนิรุธ สืบสิงห์,et al.  Data Mining Practical Machine Learning Tools and Techniques , 2014 .

[8]  Andrew McCallum,et al.  Efficient clustering of high-dimensional data sets with application to reference matching , 2000, KDD '00.

[9]  H B NEWCOMBE,et al.  Automatic linkage of vital records. , 1959, Science.

[10]  Ian H. Witten,et al.  Data mining - practical machine learning tools and techniques, Second Edition , 2005, The Morgan Kaufmann series in data management systems.

[11]  W. Winkler Overview of Record Linkage and Current Research Directions , 2006 .

[12]  Esko Ukkonen,et al.  Approximate String Matching with q-grams and Maximal Matches , 1992, Theor. Comput. Sci..

[13]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[14]  M E J Newman,et al.  Finding and evaluating community structure in networks. , 2003, Physical review. E, Statistical, nonlinear, and soft matter physics.

[15]  Ahmed K. Elmagarmid,et al.  Duplicate Record Detection: A Survey , 2007, IEEE Transactions on Knowledge and Data Engineering.

[16]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[17]  Lise Getoor,et al.  Deduplication and Group Detection using Links , 2004 .

[18]  Peter Christen,et al.  Febrl - Freely extensible biomedical record linkage , 2002 .

[19]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[20]  Salvatore J. Stolfo,et al.  Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem , 1998, Data Mining and Knowledge Discovery.

[21]  Anuradha Bhamidipaty,et al.  Interactive deduplication using active learning , 2002, KDD.

[22]  Luis Gravano,et al.  Text joins in an RDBMS for web data integration , 2003, WWW '03.

[23]  Renée J. Miller,et al.  Framework for Evaluating Clustering Algorithms in Duplicate Detection , 2009, Proc. VLDB Endow..

[24]  Don X. Sun,et al.  Methods for Linking and Mining Massive Heterogeneous Databases , 1998, KDD.

[25]  Jon Williamson,et al.  Bayesian Nets and Causality: Philosophical and Computational Foundations , 2005 .

[26]  Ahmed K. Elmagarmid,et al.  Automating the approximate record-matching process , 2000, Inf. Sci..

[27]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems , 1988 .

[28]  Carlo Batini,et al.  Data Quality: Concepts, Methodologies and Techniques (Data-Centric Systems and Applications) , 2006 .

[29]  Stuart J. Russell,et al.  Identity Uncertainty and Citation Matching , 2002, NIPS.

[30]  Jianzhong Li,et al.  Reasoning about Record Matching Rules , 2009, Proc. VLDB Endow..

[31]  Georgia Koutrika,et al.  Entity resolution with iterative blocking , 2009, SIGMOD Conference.

[32]  Lise Getoor,et al.  A Latent Dirichlet Model for Unsupervised Entity Resolution , 2005, SDM.

[33]  Sudipto Guha,et al.  Merging the Results of Approximate Match Operations , 2004, VLDB.

[34]  Weiyi Meng,et al.  Efficient SPectrAl Neighborhood blocking for entity resolution , 2011, 2011 IEEE 27th International Conference on Data Engineering.