A Type-Based Blocking Technique for Efficient Entity Resolution over Large-Scale Data

In data integration, entity resolution is an important technique to improve data quality. Existing researches typically assume that the target dataset only contain string-type data and use single similarity metric. For larger high-dimensional dataset, redundant information needs to be verified using traditional blocking or windowing techniques. In this work, we propose a novel ER-resolving method using a hybrid approach, including type-based multiblocks, varying window size, and more flexible similarity metrics. In our new ER workflow, we reduce the searching space for entity pairs by the constraint of redundant attributes and matching likelihood. We develop a reference implementation of our proposed approach and validate its performance using real-life dataset from one Internet of Things project. We evaluate the data processing system using five standard metrics including effectiveness, efficiency, accuracy, recall, and precision. Experimental results indicate that the proposed approach could be a promising alternative for entity resolution and could be feasibly applied in real-world data cleaning for large datasets.

[1]  Peter Christen,et al.  A Survey of Indexing Techniques for Scalable Record Linkage and Deduplication , 2012, IEEE Transactions on Knowledge and Data Engineering.

[2]  Ahmed K. Elmagarmid,et al.  Duplicate Record Detection: A Survey , 2007, IEEE Transactions on Knowledge and Data Engineering.

[3]  Felix Naumann,et al.  A generalization of blocking and windowing algorithms for duplicate detection , 2011, 2011 International Conference on Data and Knowledge Engineering (ICDKE).

[4]  Georgios Papadakis,et al.  Blocking techniques for efficient entity resolution over large, highly heterogeneous information spaces , 2013 .

[5]  Zhen-Jiang Zhang,et al.  A Cluster-Based Fuzzy Fusion Algorithm for Event Detection in Heterogeneous Wireless Sensor Networks , 2015, J. Sensors.

[6]  Stuart E. Madnick,et al.  The inter-database instance identification problem in integrating autonomous systems , 1989, [1989] Proceedings. Fifth International Conference on Data Engineering.

[7]  Andreas Thor,et al.  Multi-pass sorted neighborhood blocking with MapReduce , 2012, Computer Science - Research and Development.

[8]  Qing Tan,et al.  A proposed novel enterprise cloud development application model , 2016, Memetic Computing.

[9]  Divesh Srivastava,et al.  Less is More: Selecting Sources Wisely for Integration , 2012, Proc. VLDB Endow..

[10]  Raymond J. Mooney,et al.  Adaptive duplicate detection using learnable string similarity measures , 2003, KDD '03.

[11]  Piotr Indyk,et al.  Edit Distance Cannot Be Computed in Strongly Subquadratic Time (unless SETH is false) , 2014, STOC.

[12]  Craig A. Knoblock,et al.  Learning domain-independent string transformation weights for high accuracy object identification , 2002, KDD.

[13]  Magdy Bayoumi,et al.  Proposed Centralized Data Fusion Algorithms , 2012 .

[14]  Qing Tan,et al.  Effective SQL Injection Attack Reconstruction Using Network Recording , 2011, 2011 IEEE 11th International Conference on Computer and Information Technology.

[15]  Peter Christen,et al.  A Comparison of Personal Name Matching: Techniques and Practical Issues , 2006, Sixth IEEE International Conference on Data Mining - Workshops (ICDMW'06).

[16]  Aboul Ella Hassanien,et al.  Automatic image annotation approach based on optimization of classes scores , 2014, Computing.

[17]  Georgia Koutrika,et al.  Entity resolution with iterative blocking , 2009, SIGMOD Conference.

[18]  Wolfgang Nejdl,et al.  Meta-Blocking: Taking Entity Resolutionto the Next Level , 2014, IEEE Transactions on Knowledge and Data Engineering.

[19]  G. Ruxton,et al.  Effective use of Pearson's product–moment correlation coefficient , 2014, Animal Behaviour.

[20]  Dan Bogdanov,et al.  Privacy-Preserving Statistical Data Analysis on Federated Databases , 2014, APF.

[21]  Tim Kraska,et al.  CrowdER: Crowdsourcing Entity Resolution , 2012, Proc. VLDB Endow..

[22]  R. Suganya,et al.  Data Mining Concepts and Techniques , 2010 .

[23]  Yuan Xiao-yi Optimization algorithm for cleaning data based on SNM , 2010 .

[24]  Magdy A. Bayoumi,et al.  MIRF: A Multimodal Image Registration and Fusion Module Based on DT-CWT , 2013, J. Signal Process. Syst..

[25]  Sébastien Loisel,et al.  Partitions of Pearson’s Chi-square statistic for frequency tables: a comprehensive account , 2016, Comput. Stat..

[26]  Toon Calders,et al.  Multi-Source Entity Resolution for Genealogical Data , 2015, Population Reconstruction.

[27]  Aboul Ella Hassanien,et al.  Similarity Measures Based Recommender System for Rehabilitation of People with Disabilities , 2015, AISI.

[28]  Felix Naumann,et al.  A Comparison and Generalization of Blocking and Windowing Algorithms for Duplicate Detection , 2009 .

[29]  Andreas Thor,et al.  Evaluation of entity resolution approaches on real-world match problems , 2010, Proc. VLDB Endow..

[30]  Mohamed Ally,et al.  A Ubiquitous Computing Platform - Affordable Telepresence Robot Design and Applications , 2014, 2014 IEEE 17th International Conference on Computational Science and Engineering.

[31]  Thorsten Joachims,et al.  A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization , 1997, ICML.

[32]  Frédérique C. Pivot,et al.  Big Data Privacy: Changing Perception of Privacy , 2015, 2015 IEEE International Conference on Smart City/SocialCom/SustainCom (SmartCity).

[33]  William W. Cohen,et al.  A Comparison of String Metrics for Matching Names and Records , 2003 .

[34]  Minghong Liao,et al.  An efficient data cleaning algorithm based on attributes selection , 2012, 2011 6th International Conference on Computer Sciences and Convergence Information Technology (ICCIT).

[35]  Álvaro Herrero,et al.  Characterization of Android Malware Families by a Reduced Set of Static Features , 2016, SOCO-CISIS-ICEUTE.

[36]  Salvatore J. Stolfo,et al.  The merge/purge problem for large databases , 1995, SIGMOD '95.

[37]  Kinshuk,et al.  Mobile computing architecture for multi-platform adaptation , 2009 .

[38]  Peter Fankhauser,et al.  Efficient entity resolution for large heterogeneous information spaces , 2011, WSDM '11.

[39]  Jeffrey Xu Yu,et al.  Efficient similarity joins for near-duplicate detection , 2011, TODS.

[40]  Jun Ye,et al.  Cosine similarity measures for intuitionistic fuzzy sets and their applications , 2011, Math. Comput. Model..

[41]  R. McGreal,et al.  The 5 R Adaptation Framework for Location-Based Mobile Learning Systems , 2011 .

[42]  Claudia Niederée,et al.  A Blocking Framework for Entity Resolution in Highly Heterogeneous Information Spaces , 2013, IEEE Transactions on Knowledge and Data Engineering.

[43]  Pradeep Ravikumar,et al.  Adaptive Name Matching in Information Integration , 2003, IEEE Intell. Syst..

[44]  Álvaro Herrero,et al.  Key features for the characterization of Android malware families , 2017, Log. J. IGPL.

[45]  Peter Christen,et al.  Automatic record linkage using seeded nearest neighbour and support vector machine classification , 2008, KDD.

[46]  Víctor M. González Suárez,et al.  Generalized Models for the Classification of Abnormal Movements in Daily Life and its Applicability to Epilepsy Convulsion Recognition , 2016, Int. J. Neural Syst..