Probabilistic parallelisation of blocking non-matched records for big data

Blocking is a technique of filtering unlikely matched pairs for record matching, which aims to collect all pairs of records that relate to the same entities across different data sources. Blocking has been broadly adopted in data mining and database. However, for big data, there is no fast and effective blocking algorithm yet, because the number of candidate pairs is tremendous between large data sets. In this paper, we report on a probabilistic parallelisation of a recently proposed blocking that is a sequential algorithm for efficient record matching in single machines. Our approach runs blocking processes distributedly on partitioned input data. In order to reduce data exchange among those blocking processes, we adopt a probabilistic technique to assure that the processes can run independently and meanwhile the aggregated result is correct with respect to common metrics. Our experimental analysis endorses the advantage of our technique and shows its novel scalability on a Hadoop map-reduce system deployed physically in a cloud.

[1]  Peter Christen,et al.  Febrl - A Parallel Open Source Data Linkage System: http://datamining.anu.edu.au/linkage.html , 2004, PAKDD.

[2]  Salvatore J. Stolfo,et al.  The merge/purge problem for large databases , 1995, SIGMOD '95.

[3]  Surajit Chaudhuri,et al.  Example-driven design of efficient record matching queries , 2007, VLDB.

[4]  Raymond J. Mooney,et al.  Adaptive Blocking: Learning to Scale Up Record Linkage , 2006, Sixth International Conference on Data Mining (ICDM'06).

[5]  Vasilis Efthymiou,et al.  Big data entity resolution: From highly to somehow similar entity descriptions in the Web , 2015, 2015 IEEE International Conference on Big Data (Big Data).

[6]  Dongyao Wu,et al.  Building Pipelines for Heterogeneous Execution Environments for Big Data Processing , 2016, IEEE Software.

[7]  Dongwon Lee,et al.  Parallel linkage , 2007, CIKM '07.

[8]  Peter Christen,et al.  A Survey of Indexing Techniques for Scalable Record Linkage and Deduplication , 2012, IEEE Transactions on Knowledge and Data Engineering.

[9]  Guoqiang Li,et al.  LogProv: Logging events as provenance of big data analytics pipelines with trustworthiness , 2016, 2016 IEEE International Conference on Big Data (Big Data).

[10]  Andreas Thor,et al.  Dedoop: Efficient Deduplication with Hadoop , 2012, Proc. VLDB Endow..

[11]  George Papastefanatos,et al.  Parallel meta-blocking: Realizing scalable entity resolution over large, heterogeneous data , 2015, 2015 IEEE International Conference on Big Data (Big Data).

[12]  Georgia Koutrika,et al.  Entity resolution with iterative blocking , 2009, SIGMOD Conference.

[13]  Wolfgang Nejdl,et al.  Meta-Blocking: Taking Entity Resolutionto the Next Level , 2014, IEEE Transactions on Knowledge and Data Engineering.

[14]  Andreas Thor,et al.  Multi-pass sorted neighborhood blocking with MapReduce , 2012, Computer Science - Research and Development.

[15]  Craig A. Knoblock,et al.  Learning domain-independent string transformation weights for high accuracy object identification , 2002, KDD.

[16]  Raghav Kaushik,et al.  On active learning of record matching packages , 2010, SIGMOD Conference.

[17]  Anuradha Bhamidipaty,et al.  Interactive deduplication using active learning , 2002, KDD.

[18]  Howard B. Newcombe,et al.  Handbook of record linkage: methods for health and statistical studies, administration, and business , 1988 .

[19]  Ahmed K. Elmagarmid,et al.  Duplicate Record Detection: A Survey , 2007, IEEE Transactions on Knowledge and Data Engineering.

[20]  Daphne Koller,et al.  Support Vector Machine Active Learning with Applications to Text Classification , 2000, J. Mach. Learn. Res..

[21]  Mikhail Bilenko and Raymond J. Mooney,et al.  On Evaluation and Training-Set Construction for Duplicate Detection , 2003 .

[22]  Raymond K. Wong,et al.  Unsupervised Blocking of Imbalanced Datasets for Record Matching , 2016, WISE.

[23]  Craig A. Knoblock,et al.  Learning Blocking Schemes for Record Linkage , 2006, AAAI.