Multi-pass sorted neighborhood blocking with MapReduce

Cloud infrastructures enable the efficient parallel execution of data-intensive tasks such as entity resolution on large datasets. We investigate challenges and possible solutions of using the MapReduce programming model for parallel entity resolution using Sorting Neighborhood blocking (SN). We propose and evaluate two efficient MapReduce-based implementations for single- and multi-pass SN that either use multiple MapReduce jobs or apply a tailored data replication. We also propose an automatic data partitioning approach for multi-pass SN to achieve load balancing. Our evaluation based on real-world datasets shows the high efficiency and effectiveness of the proposed approaches.

[1]  Randy H. Katz,et al.  Above the Clouds: A Berkeley View of Cloud Computing , 2009 .

[2]  J StolfoSalvatore,et al.  The merge/purge problem for large databases , 1995 .

[3]  David J. DeWitt,et al.  Practical Skew Handling in Parallel Joins , 1992, VLDB.

[4]  Peter Christen,et al.  A Comparison of Fast Blocking Methods for Record Linkage , 2003, KDD 2003.

[5]  Erhard Rahm,et al.  Data Partitioning for Parallel Entity Matching , 2010, ArXiv.

[6]  David J. DeWitt,et al.  Parallel database systems: the future of high performance database systems , 1992, CACM.

[7]  Randy H. Katz,et al.  A view of cloud computing , 2010, CACM.

[8]  Peter Christen,et al.  Febrl - A Parallel Open Source Data Linkage System: http://datamining.anu.edu.au/linkage.html , 2004, PAKDD.

[9]  Salvatore J. Stolfo,et al.  The merge/purge problem for large databases , 1995, SIGMOD '95.

[10]  Erhard Rahm,et al.  Frameworks for entity matching: A comparison , 2010, Data Knowl. Eng..

[11]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[12]  Dongwon Lee,et al.  Parallel linkage , 2007, CIKM '07.

[13]  Chen Li,et al.  Efficient parallel set-similarity joins using MapReduce , 2010, SIGMOD Conference.

[14]  Carlo Batini,et al.  Data Quality: Concepts, Methodologies and Techniques (Data-Centric Systems and Applications) , 2006 .

[15]  Raymond J. Mooney,et al.  Adaptive duplicate detection using learnable string similarity measures , 2003, KDD '03.

[16]  Andreas Thor,et al.  Evaluation of entity resolution approaches on real-world match problems , 2010, Proc. VLDB Endow..

[17]  Erhard Rahm,et al.  Data Cleaning: Problems and Current Approaches , 2000, IEEE Data Eng. Bull..

[18]  Andreas Thor,et al.  Parallel Sorted Neighborhood Blocking with MapReduce , 2011, BTW.

[19]  Peter Christen,et al.  Febrl -: an open source data cleaning, deduplication and record linkage system with a graphical user interface , 2008, KDD.

[20]  Ahmed K. Elmagarmid,et al.  Duplicate Record Detection: A Survey , 2007, IEEE Transactions on Knowledge and Data Engineering.

[21]  Andreas Thor,et al.  Learning-Based Approaches for Matching Web Data Entities , 2010, IEEE Internet Computing.