An unsupervised blocking technique for more efficient record linkage

Abstract Record linkage, referred to also as entity resolution, is the process of identifying pairs of records representing the same real-world entity (for example, a person) within a dataset or across multiple datasets. This allows for the integration of multi-source data which allows for better knowledge discovery. In order to reduce the number of record comparisons, record linkage frameworks initially perform a process commonly referred to as blocking, which involves separating records into blocks using a partition (or blocking) scheme. This restricts comparisons among records that belong to the same block during the linkage process. Existing blocking techniques often require some form of manual fine-tuning of parameter values for optimal performance. Optimal parameter values may be selected manually by a domain expert, or automatically learned using labelled data. However, in many real world situations no such labelled dataset may be available. In this paper we propose a novel unsupervised blocking technique for structured datasets that does not require labelled data or manual fine-tuning of parameters. Experimental evaluations, across a large number of datasets, demonstrate that this novel approach often achieves superior levels of proficiency to both supervised and unsupervised baseline techniques, often in less time.

[1]  Hector Garcia-Molina,et al.  Pay-As-You-Go Entity Resolution , 2013, IEEE Transactions on Knowledge and Data Engineering.

[2]  Peter Christen,et al.  A Survey of Indexing Techniques for Scalable Record Linkage and Deduplication , 2012, IEEE Transactions on Knowledge and Data Engineering.

[3]  George Papastefanatos,et al.  Scaling Entity Resolution to Large, Heterogeneous Data with Enhanced Meta-blocking , 2016, EDBT.

[4]  Daniel P. Miranker,et al.  On Linking Heterogeneous Dataset Collections , 2014, SEMWEB.

[5]  Craig A. Knoblock,et al.  Learning Blocking Schemes for Record Linkage , 2006, AAAI.

[6]  Anna Jurek-Loughrey,et al.  A Review of Unsupervised and Semi-supervised Blocking Methods for Record Linkage , 2018, Unsupervised and Semi-Supervised Learning.

[7]  Peter Christen,et al.  Febrl -: an open source data cleaning, deduplication and record linkage system with a graphical user interface , 2008, KDD.

[8]  Claudia Niederée,et al.  A Blocking Framework for Entity Resolution in Highly Heterogeneous Information Spaces , 2013, IEEE Transactions on Knowledge and Data Engineering.

[9]  Felix Naumann,et al.  Progressive Duplicate Detection , 2015, IEEE Transactions on Knowledge and Data Engineering.

[10]  Marcos André Gonçalves,et al.  BLOSS: Effective meta-blocking with almost no effort , 2018, Inf. Syst..

[11]  Andreas Thor,et al.  Learning-Based Approaches for Matching Web Data Entities , 2010, IEEE Internet Computing.

[12]  Vassilios S. Verykios,et al.  Scalable Blocking for Privacy Preserving Record Linkage , 2015, KDD.

[13]  Peter Fankhauser,et al.  Efficient entity resolution for large heterogeneous information spaces , 2011, WSDM '11.

[14]  Erhard Rahm,et al.  Frameworks for entity matching: A comparison , 2010, Data Knowl. Eng..

[15]  David G. Stork,et al.  Pattern Classification , 1973 .

[16]  Jeff Heflin,et al.  Automatically Generating Data Linkages Using a Domain-Independent Candidate Selection Approach , 2011, SEMWEB.

[17]  George Papastefanatos,et al.  Supervised Meta-blocking , 2014, Proc. VLDB Endow..

[18]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[19]  Kaizhong Zhang,et al.  Evaluating a class of distance-mapping algorithms for data mining and clustering , 1999, KDD '99.

[20]  Felix Naumann,et al.  Schema matching using duplicates , 2005, 21st International Conference on Data Engineering (ICDE'05).

[21]  Peter Christen,et al.  Data Matching , 2012, Data-Centric Systems and Applications.

[22]  Salvatore J. Stolfo,et al.  Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem , 1998, Data Mining and Knowledge Discovery.

[23]  Vassilios S. Verykios,et al.  An LSH-Based Blocking Approach with a Homomorphic Matching Technique for Privacy-Preserving Record Linkage , 2015, IEEE Transactions on Knowledge and Data Engineering.

[24]  Ahmed K. Elmagarmid,et al.  TAILOR: a record linkage toolbox , 2002, Proceedings 18th International Conference on Data Engineering.

[25]  Panagiotis G. Ipeirotis,et al.  Duplicate Record Detection: A Survey , 2007 .

[26]  Raymond J. Mooney,et al.  Adaptive Blocking: Learning to Scale Up Record Linkage , 2006, Sixth International Conference on Data Mining (ICDM'06).

[27]  Anna Jurek,et al.  A new technique of selecting an optimal blocking method for better record linkage , 2018, Inf. Syst..

[28]  Piotr Indyk,et al.  Similarity Search in High Dimensions via Hashing , 1999, VLDB.

[29]  Chen Li,et al.  Efficient record linkage in large data sets , 2003, Eighth International Conference on Database Systems for Advanced Applications, 2003. (DASFAA 2003). Proceedings..

[30]  Sonia Bergamaschi,et al.  BLAST: a Loosely Schema-aware Meta-blocking Approach for Entity Resolution , 2016, Proc. VLDB Endow..

[31]  Avigdor Gal,et al.  Comparative Analysis of Approximate Blocking Techniques for Entity Resolution , 2016, Proc. VLDB Endow..

[32]  P Deepak,et al.  Semi-supervised and Unsupervised Approaches to Record Pairs Classification in Multi-Source Data Linkage , 2018, Unsupervised and Semi-Supervised Learning.

[33]  Wolfgang Nejdl,et al.  Meta-Blocking: Taking Entity Resolutionto the Next Level , 2014, IEEE Transactions on Knowledge and Data Engineering.

[34]  Christos Faloutsos,et al.  FastMap: a fast algorithm for indexing, data-mining and visualization of traditional and multimedia datasets , 1995, SIGMOD '95.

[35]  Anand Rajaraman,et al.  Mining of Massive Datasets , 2011 .

[36]  Raymond K. Wong,et al.  Unsupervised Blocking of Imbalanced Datasets for Record Matching , 2016, WISE.

[37]  Huizhi Liang,et al.  Semantic-Aware Blocking for Entity Resolution , 2016, IEEE Trans. Knowl. Data Eng..

[38]  R. Mooney,et al.  Learnable similarity functions and their application to record linkage and clustering , 2006 .

[39]  George Papastefanatos,et al.  Boosting the Efficiency of Large-Scale Entity Resolution with Enhanced Meta-Blocking , 2016, Big Data Res..

[40]  Daniel P. Miranker,et al.  Semi-supervised Instance Matching Using Boosted Classifiers , 2015, ESWC.

[41]  Daniel P. Miranker,et al.  A two-step blocking scheme learner for scalable link discovery , 2014, OM.

[42]  Andreas Thor,et al.  Evaluation of entity resolution approaches on real-world match problems , 2010, Proc. VLDB Endow..

[43]  George Papadakis,et al.  Blocking for large-scale Entity Resolution: Challenges, algorithms, and practical examples , 2016, 2016 IEEE 32nd International Conference on Data Engineering (ICDE).

[44]  Daniel P. Miranker,et al.  An Unsupervised Algorithm for Learning Blocking Schemes , 2013, 2013 IEEE 13th International Conference on Data Mining.

[45]  Andrew McCallum,et al.  Efficient clustering of high-dimensional data sets with application to reference matching , 2000, KDD '00.