A fast approach for parallel deduplication on multicore processors

In this paper, we propose a fast approach that parallelizes the deduplication process on multicore processors. Our approach, named MD-Approach, combines an efficient blocking method with a robust data parallel programming model. The blocking phase is composed of two steps. The first step generates large blocks by grouping records with low degree of similarity. The second step segments large blocks, that may result in unbalanced load, in more precise sub-blocks. A parallel data programming model is used to implement our approach in a sequence of both map and reduce operations. An empirical evaluation has shown that our deduplication approach is almost twice faster than BTO-BK, that is a scalable parallel deduplication solution in distributed environment. To the best of our knowledge, MD-Approach is the first to focus on multicore processors for parallel dedu-plication.

[1]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[2]  Hector Garcia-Molina,et al.  D-Swoosh: A Family of Algorithms for Generic, Distributed Entity Resolution , 2007, 27th International Conference on Distributed Computing Systems (ICDCS '07).

[3]  Sanjay Chawla,et al.  Robust record linkage blocking using suffix arrays , 2009, CIKM.

[4]  Chen Li,et al.  Efficient parallel set-similarity joins using MapReduce , 2010, SIGMOD Conference.

[5]  Dongwon Lee,et al.  Parallel linkage , 2007, CIKM '07.

[6]  Christoforos E. Kozyrakis,et al.  Evaluating MapReduce for Multi-core and Multiprocessor Systems , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.

[7]  Peter Christen,et al.  Febrl - A Parallel Open Source Data Linkage System: http://datamining.anu.edu.au/linkage.html , 2004, PAKDD.

[8]  Peter Christen,et al.  A Comparison of Fast Blocking Methods for Record Linkage , 2003, KDD 2003.

[9]  Ahmed K. Elmagarmid,et al.  Duplicate Record Detection: A Survey , 2007, IEEE Transactions on Knowledge and Data Engineering.

[10]  Panagiotis G. Ipeirotis,et al.  Duplicate Record Detection: A Survey , 2007 .

[11]  Keizo Oyama,et al.  A Fast Linkage Detection Scheme for Multi-Source Information Integration , 2005, International Workshop on Challenges in Web Information Retrieval and Integration.

[12]  Ann Q. Gates,et al.  TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING , 2005 .

[13]  Andrew McCallum,et al.  Efficient clustering of high-dimensional data sets with application to reference matching , 2000, KDD '00.

[14]  Wagner Meira,et al.  A Scalable Parallel Deduplication Algorithm , 2007, 19th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD'07).

[15]  Peter Christen,et al.  Probabilistic Data Generation for Deduplication and Data Linkage , 2005, IDEAL.