High-Value Token-Blocking: Efficient Blocking Method for Record Linkage

Data integration is an important component of Big Data analytics. One of the key challenges in data integration is record linkage, that is, matching records that represent the same real-world entity. Because of computational costs, methods referred to as blocking are employed as a part of the record linkage pipeline in order to reduce the number of comparisons among records. In the past decade, a range of blocking techniques have been proposed. Real-world applications require approaches that can handle heterogeneous data sources and do not rely on labelled data. We propose high-value token-blocking (HVTB), a simple and efficient approach for blocking that is unsupervised and schema-agnostic, based on a crafted use of Term Frequency-Inverse Document Frequency. We compare HVTB with multiple methods and over a range of datasets, including a novel unstructured dataset composed of titles and abstracts of scientific papers. We thoroughly discuss results in terms of accuracy, use of computational resources, and different characteristics of datasets and records. The simplicity of HVTB yields fast computations and does not harm its accuracy when compared with existing approaches. It is shown to be significantly superior to other methods, suggesting that simpler methods for blocking should be considered before resorting to more sophisticated methods.

[1]  Jeffrey Xu Yu,et al.  Entity Matching: How Similar Is Similar , 2011, Proc. VLDB Endow..

[2]  Wolfgang Nejdl,et al.  Efficient Semantic-Aware Detection of Near Duplicate Resources , 2010, ESWC.

[3]  George Papastefanatos,et al.  Scaling Entity Resolution to Large, Heterogeneous Data with Enhanced Meta-blocking , 2016, EDBT.

[4]  Shafiq R. Joty,et al.  Distributed Representations of Tuples for Entity Resolution , 2018, Proc. VLDB Endow..

[5]  Peter Christen,et al.  Febrl -: an open source data cleaning, deduplication and record linkage system with a graphical user interface , 2008, KDD.

[6]  Claudia Niederée,et al.  A Blocking Framework for Entity Resolution in Highly Heterogeneous Information Spaces , 2013, IEEE Transactions on Knowledge and Data Engineering.

[7]  Piotr Indyk,et al.  Similarity Search in High Dimensions via Hashing , 1999, VLDB.

[8]  Weiru Liu,et al.  A novel ensemble learning approach to unsupervised record linkage , 2017, Inf. Syst..

[9]  George Papastefanatos,et al.  Boosting the Efficiency of Large-Scale Entity Resolution with Enhanced Meta-Blocking , 2016, Big Data Res..

[10]  Stephen E. Fienberg,et al.  A Comparison of Blocking Methods for Record Linkage , 2014, Privacy in Statistical Databases.

[11]  Peter Christen,et al.  Data Matching , 2012, Data-Centric Systems and Applications.

[12]  Qing Wang,et al.  Semantic-Aware Blocking for Entity Resolution , 2016, IEEE Transactions on Knowledge and Data Engineering.

[13]  Peter Fankhauser,et al.  Efficient entity resolution for large heterogeneous information spaces , 2011, WSDM '11.

[14]  Daniel P. Miranker,et al.  On Linking Heterogeneous Dataset Collections , 2014, SEMWEB.

[16]  Anna Jurek,et al.  A new technique of selecting an optimal blocking method for better record linkage , 2018, Inf. Syst..

[17]  Salvatore J. Stolfo,et al.  Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem , 1998, Data Mining and Knowledge Discovery.

[18]  Robert Isele,et al.  Learning Expressive Linkage Rules using Genetic Programming , 2012, Proc. VLDB Endow..

[19]  Fatemeh Karimkhani,et al.  Deep Block: A Novel Blocking Approach for Entity Resolution using Deep Learning , 2019 .

[20]  Sonia Bergamaschi,et al.  BLAST: a Loosely Schema-aware Meta-blocking Approach for Entity Resolution , 2016, Proc. VLDB Endow..

[21]  Mingyuan Cui Towards a Scalable and Robust Entity Resolution-Approximate Blocking with Semantic Constraints , 2014 .

[22]  Claudia Niederée,et al.  Eliminating the redundancy in blocking-based entity resolution methods , 2011, JCDL '11.

[23]  Vassilios S. Verykios,et al.  An LSH-Based Blocking Approach with a Homomorphic Matching Technique for Privacy-Preserving Record Linkage , 2015, IEEE Transactions on Knowledge and Data Engineering.

[24]  Carlos Eduardo S. Pires,et al.  Spark-based Streamlined Metablocking , 2017, 2017 IEEE Symposium on Computers and Communications (ISCC).

[25]  Marcos André Gonçalves,et al.  BLOSS: Effective meta-blocking with almost no effort , 2018, Inf. Syst..

[26]  Qing Wang,et al.  Efficient Interactive Training Selection for Large-Scale Entity Resolution , 2015, PAKDD.

[27]  Dongwon Lee,et al.  HARRA: fast iterative hashed record linkage for large-scale data collections , 2010, EDBT '10.

[28]  Daniel P. Miranker,et al.  An Unsupervised Algorithm for Learning Blocking Schemes , 2013, 2013 IEEE 13th International Conference on Data Mining.

[29]  Din J. Wasem,et al.  Mining of Massive Datasets , 2014 .

[30]  W. Winkler Overview of Record Linkage and Current Research Directions , 2006 .

[31]  Raymond J. Mooney,et al.  Adaptive Blocking: Learning to Scale Up Record Linkage , 2006, Sixth International Conference on Data Mining (ICDM'06).

[32]  Ahmed K. Elmagarmid,et al.  TAILOR: a record linkage toolbox , 2002, Proceedings 18th International Conference on Data Engineering.

[33]  Sonia Bergamaschi,et al.  Schema-Agnostic Progressive Entity Resolution , 2018, 2018 IEEE 34th International Conference on Data Engineering (ICDE).

[34]  Craig A. Knoblock,et al.  Learning Blocking Schemes for Record Linkage , 2006, AAAI.

[35]  Daniel P. Miranker,et al.  An unsupervised instance matcher for schema-free RDF data , 2015, J. Web Semant..

[36]  Daniel P. Miranker,et al.  A two-step blocking scheme learner for scalable link discovery , 2014, OM.

[37]  Claudia Niederée,et al.  To compare or not to compare: making entity resolution more efficient , 2011, SWIM '11.

[38]  Wolfgang Nejdl,et al.  Meta-Blocking: Taking Entity Resolutionto the Next Level , 2014, IEEE Transactions on Knowledge and Data Engineering.

[39]  Avigdor Gal,et al.  Comparative Analysis of Approximate Blocking Techniques for Entity Resolution , 2016, Proc. VLDB Endow..

[40]  George Papadakis,et al.  Blocking for large-scale Entity Resolution: Challenges, algorithms, and practical examples , 2016, 2016 IEEE 32nd International Conference on Data Engineering (ICDE).

[41]  Sonia Bergamaschi,et al.  Schema-agnostic Progressive Entity Resolution (extended version) , 2019, ArXiv.

[42]  Chen Li,et al.  Efficient record linkage in large data sets , 2003, Eighth International Conference on Database Systems for Advanced Applications, 2003. (DASFAA 2003). Proceedings..

[43]  Andrew McCallum,et al.  Efficient clustering of high-dimensional data sets with application to reference matching , 2000, KDD '00.