A new technique of selecting an optimal blocking method for better record linkage

Abstract Record linkage, referred to also as entity resolution, is the process of identifying pairs of records representing the same real world entity (e.g. a person) within a dataset or across multiple datasets. In order to reduce the number of record comparisons, record linkage frameworks initially perform a process referred to as blocking, which involves splitting records into a set of blocks using a partition (or blocking) scheme. This restricts comparisons among records that belong to the same block during the linkage process. Existing blocking methods are often evaluated using different metrics and independently of the choice of the subsequent linkage method, which makes the choice of an optimal approach very subjective. In this paper we demonstrate that existing evaluation metrics fail to provide strong evidence to support the selection of an optimal blocking method. We conduct an extensive evaluation of different blocking methods using multiple datasets and some commonly applied linkage techniques to show that evaluation of a blocking method must take into consideration the subsequent linkage phase. We propose a novel evaluation technique that takes into consideration multiple factors including the end-to-end running time of the combined blocking and linkage phases as well as the linkage technique used. We empirically demonstrate using multiple datasets that according to this novel evaluation technique some blocking methods can be fairly considered superior to others, while some should be deemed incomparable according to those factors. Finally, we propose a novel blocking method selection procedure that takes into consideration the linkage proficiency and end-to-end time of different blocking methods combined with a given linkage technique. We show that this technique is able to select the best or near best blocking method for unseen data.

[1]  Avigdor Gal,et al.  MFIBlocks: An effective blocking algorithm for entity resolution , 2013, Inf. Syst..

[2]  Peter Christen,et al.  Febrl -: an open source data cleaning, deduplication and record linkage system with a graphical user interface , 2008, KDD.

[3]  Sonia Bergamaschi,et al.  BLAST: a Loosely Schema-aware Meta-blocking Approach for Entity Resolution , 2016, Proc. VLDB Endow..

[4]  Peter Christen,et al.  Data Matching , 2012, Data-Centric Systems and Applications.

[5]  Avigdor Gal,et al.  Comparative Analysis of Approximate Blocking Techniques for Entity Resolution , 2016, Proc. VLDB Endow..

[6]  George Papastefanatos,et al.  Supervised Meta-blocking , 2014, Proc. VLDB Endow..

[7]  Peter Christen Automatic Training Example Selection for Scalable Unsupervised Record Linkage , 2008, PAKDD.

[8]  Georgia Koutrika,et al.  Entity resolution with iterative blocking , 2009, SIGMOD Conference.

[9]  Daniel P. Miranker,et al.  An unsupervised instance matcher for schema-free RDF data , 2015, J. Web Semant..

[10]  Ahmed K. Elmagarmid,et al.  TAILOR: a record linkage toolbox , 2002, Proceedings 18th International Conference on Data Engineering.

[11]  Robert Isele,et al.  Efficient Multidimensional Blocking for Link Discovery without losing Recall , 2011, WebDB.

[12]  Daniel P. Miranker,et al.  Semi-supervised Instance Matching Using Boosted Classifiers , 2015, ESWC.

[13]  Leopoldo E. Bertossi,et al.  ERBlox: Combining matching dependencies with machine learning for entity resolution , 2015, Int. J. Approx. Reason..

[14]  Qing Wang,et al.  A Clustering-Based Framework to Control Block Sizes for Entity Resolution , 2015, KDD.

[15]  Andreas Thor,et al.  Evaluation of entity resolution approaches on real-world match problems , 2010, Proc. VLDB Endow..

[16]  David G. Stork,et al.  Pattern Classification , 1973 .

[17]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[18]  Raymond J. Mooney,et al.  Adaptive duplicate detection using learnable string similarity measures , 2003, KDD '03.

[19]  Fabrice Guillet,et al.  Quality Measures in Data Mining , 2009, Studies in Computational Intelligence.

[20]  Claudia Niederée,et al.  A Blocking Framework for Entity Resolution in Highly Heterogeneous Information Spaces , 2013, IEEE Transactions on Knowledge and Data Engineering.

[21]  Pradeep Ravikumar,et al.  Adaptive Name Matching in Information Integration , 2003, IEEE Intell. Syst..

[22]  Raymond J. Mooney,et al.  Adaptive Blocking: Learning to Scale Up Record Linkage , 2006, Sixth International Conference on Data Mining (ICDM'06).

[23]  George Papastefanatos,et al.  Scaling Entity Resolution to Large, Heterogeneous Data with Enhanced Meta-blocking , 2016, EDBT.

[24]  Craig A. Knoblock,et al.  Learning Blocking Schemes for Record Linkage , 2006, AAAI.

[25]  Daniel P. Miranker,et al.  An Unsupervised Algorithm for Learning Blocking Schemes , 2013, 2013 IEEE 13th International Conference on Data Mining.

[26]  Vom Fachbereich Informatik Unsupervised Duplicate Detection Using Sample Non-Duplicates , 2006 .

[27]  Marcos André Gonçalves,et al.  A Genetic Programming Approach to Record Deduplication , 2012, IEEE Transactions on Knowledge and Data Engineering.

[28]  Mikhail Bilenko and Raymond J. Mooney,et al.  On Evaluation and Training-Set Construction for Duplicate Detection , 2003 .

[29]  R. Mooney,et al.  Learnable similarity functions and their application to record linkage and clustering , 2006 .

[30]  Ahmed K. Elmagarmid,et al.  Duplicate Record Detection: A Survey , 2007, IEEE Transactions on Knowledge and Data Engineering.

[31]  Peter Christen,et al.  A Two-Step Classification Approach to Unsupervised Record Linkage , 2007, AusDM.

[32]  Erhard Rahm,et al.  Frameworks for entity matching: A comparison , 2010, Data Knowl. Eng..