Exploiting block co-occurrence to control block sizes for entity resolution

The problem of identifying duplicated entities in a dataset has gained increasing importance during the last decades. Due to the large size of the datasets, this problem can be very costly to be solved due to its intrinsic quadratic complexity. Both researchers and practitioners have developed a variety of techniques aiming to speed up a solution to this problem. One of these techniques is called blocking, an indexing technique that splits the dataset into a set of blocks, such that each block contains entities that share a common property evaluated by a blocking key function. In order to improve the efficacy of the blocking technique, multiple blocking keys may be used, and thus, a set of blocking results is generated. In this paper, we investigate how to control the size of the blocks generated by the use of multiple blocking keys and maintain reasonable quality results, which is measured by the quality of the produced blocks. By controlling the size of the blocks, we can reduce the overall cost of solving an entity resolution problem and facilitate the execution of a variety of tasks (e.g., real-time and privacy-preserving entity resolution). For doing so, we propose many heuristics which exploit the co-occurrence of entities among the generated blocks for pruning, splitting and merging blocks. The experimental results we carry out using four datasets confirm the adequacy of the proposed heuristics for generating block sizes within a predefined range threshold as well as maintaining reasonable blocking quality results.

[1]  Carlos Eduardo S. Pires,et al.  Data Quality Monitoring of Cloud Databases Based on Data Quality SLAs , 2015, Big-Data Analytics and Cloud Computing.

[2]  Pasi Fränti,et al.  Balanced K-Means for Clustering , 2014, S+SSPR.

[3]  Peter Christen,et al.  Hashing-Based Distributed Multi-party Blocking for Privacy-Preserving Record Linkage , 2016, PAKDD.

[4]  Carlos Eduardo S. Pires,et al.  Adaptive sorted neighborhood blocking for entity matching with MapReduce , 2015, SAC.

[5]  Raymond J. Mooney,et al.  Adaptive Blocking: Learning to Scale Up Record Linkage , 2006, Sixth International Conference on Data Mining (ICDM'06).

[6]  Sanjay Chawla,et al.  Robust record linkage blocking using suffix arrays , 2009, CIKM.

[7]  Peter Christen,et al.  Clustering-Based Scalable Indexing for Multi-party Privacy-Preserving Record Linkage , 2015, PAKDD.

[8]  Wolfgang Nejdl,et al.  Meta-Blocking: Taking Entity Resolutionto the Next Level , 2014, IEEE Transactions on Knowledge and Data Engineering.

[9]  Shafiq R. Joty,et al.  Distributed Representations of Tuples for Entity Resolution , 2018, Proc. VLDB Endow..

[10]  Carlos Eduardo S. Pires,et al.  Heuristic-based approaches for speeding up incremental record linkage , 2018, J. Syst. Softw..

[11]  Shumeet Baluja,et al.  LSH banding for large-scale retrieval with memory and recall constraints , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[12]  Jordi Forné,et al.  A modification of the k-means method for quasi-unsupervised learning , 2013, Knowl. Based Syst..

[13]  Qing Wang,et al.  A Clustering-Based Framework to Control Block Sizes for Entity Resolution , 2015, KDD.

[14]  Renée J. Miller,et al.  Framework for Evaluating Clustering Algorithms in Duplicate Detection , 2009, Proc. VLDB Endow..

[15]  Carlo Batini,et al.  Methodologies for data quality assessment and improvement , 2009, CSUR.

[16]  Felix Naumann,et al.  Progressive Duplicate Detection , 2015, IEEE Transactions on Knowledge and Data Engineering.

[17]  Georgia Koutrika,et al.  Entity resolution with iterative blocking , 2009, SIGMOD Conference.

[18]  Peter Christen,et al.  Data Matching , 2012, Data-Centric Systems and Applications.

[19]  Peter Christen,et al.  A taxonomy of privacy-preserving record linkage techniques , 2013, Inf. Syst..

[20]  Craig A. Knoblock,et al.  Learning Blocking Schemes for Record Linkage , 2006, AAAI.

[21]  Shunzhi Zhu,et al.  Data clustering with size constraints , 2010, Knowl. Based Syst..

[22]  Hector Garcia-Molina,et al.  Pay-As-You-Go Entity Resolution , 2013, IEEE Transactions on Knowledge and Data Engineering.

[23]  Gianni Costa,et al.  An incremental clustering scheme for data de-duplication , 2009, Data Mining and Knowledge Discovery.

[24]  Vassilios S. Verykios,et al.  Privacy preserving record linkage approaches , 2009, Int. J. Data Min. Model. Manag..

[25]  Carlos Eduardo S. Pires,et al.  Improving load balancing for MapReduce-based entity matching , 2013, 2013 IEEE Symposium on Computers and Communications (ISCC).

[26]  Nikolaus Augsten,et al.  An Empirical Evaluation of Set Similarity Join Techniques , 2016, Proc. VLDB Endow..

[27]  Huizhi Liang,et al.  Dynamic Sorted Neighborhood Indexing for Real-Time Entity Resolution , 2015, ACM J. Data Inf. Qual..

[28]  Shafiq R. Joty,et al.  DeepER - Deep Entity Resolution , 2017, ArXiv.

[29]  Divesh Srivastava,et al.  Incremental Record Linkage , 2014, Proc. VLDB Endow..

[30]  Divesh Srivastava,et al.  Record linkage: similarity measures and algorithms , 2006, SIGMOD Conference.

[31]  William W. Cohen,et al.  Learning to match and cluster large high-dimensional data sets for data integration , 2002, KDD.

[32]  Peter Christen,et al.  A Survey of Indexing Techniques for Scalable Record Linkage and Deduplication , 2012, IEEE Transactions on Knowledge and Data Engineering.

[33]  Peter Christen,et al.  Sorted Nearest Neighborhood Clustering for Efficient Private Blocking , 2013, PAKDD.

[34]  Andreas Thor,et al.  Multi-pass sorted neighborhood blocking with MapReduce , 2012, Computer Science - Research and Development.

[35]  C. K. Michael Tse,et al.  Data Clustering with Cluster Size Constraints Using a Modified K-Means Algorithm , 2014, 2014 International Conference on Cyber-Enabled Distributed Computing and Knowledge Discovery.

[36]  C. Lee Giles,et al.  Adaptive sorted neighborhood methods for efficient record linkage , 2007, JCDL '07.

[37]  George Papastefanatos,et al.  Supervised Meta-blocking , 2014, Proc. VLDB Endow..

[38]  Carlos Eduardo S. Pires,et al.  Towards the efficient parallelization of multi-pass adaptive blocking for entity matching , 2017, J. Parallel Distributed Comput..

[39]  Carlo Batini,et al.  Data Quality Dimensions , 2016 .

[40]  Christophe G. Giraud-Carrier,et al.  Effective record linkage for mining campaign contribution data , 2014, Knowledge and Information Systems.