Efficient and exact duplicate detection on cloud

As the recent proliferation of social networks, mobile applications, and online services increased the rate of data gathering, to find near‐duplicate records efficiently has become a challenging issue. Related works on this problem mainly aim to propose efficient approaches on a single machine. However, when processing large‐scale dataset, the performance to identify duplicates is still far from satisfactory. In this paper, we try to handle the problem of duplicate detection applying MapReduce. We argue that the performance of utilizing MapReduce to detect duplicates mainly depends on the number of candidate record pairs and intermediate result size, which is related to the shuffle cost among different nodes in cluster. In this paper, we proposed a new signature scheme with new pruning strategies to minimize the number of candidate pairs and intermediate result size. The proposed solution is an exact one, which assures none duplicate record pair can be lost. The experimental results over both real and synthetic datasets demonstrate that our proposed signature‐based method is efficient and scalable. Copyright © 2012 John Wiley & Sons, Ltd.

[1]  Felix Naumann,et al.  An Introduction to Duplicate Detection , 2010, An Introduction to Duplicate Detection.

[2]  Sunita Sarawagi,et al.  Efficient set joins on similarity predicates , 2004, SIGMOD '04.

[3]  Surajit Chaudhuri,et al.  A Primitive Operator for Similarity Joins in Data Cleaning , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[4]  Jeffrey Xu Yu,et al.  Efficient similarity joins for near-duplicate detection , 2011, TODS.

[5]  Piotr Indyk,et al.  Scalable Techniques for Clustering the Web , 2000, WebDB.

[6]  Chen Li,et al.  Efficient parallel set-similarity joins using MapReduce , 2010, SIGMOD Conference.

[7]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[8]  Dongwon Lee,et al.  HARRA: fast iterative hashed record linkage for large-scale data collections , 2010, EDBT '10.

[9]  Christopher Ré,et al.  Large-Scale Deduplication with Constraints Using Dedupalog , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[10]  Geoffrey Zweig,et al.  Syntactic Clustering of the Web , 1997, Comput. Networks.

[11]  Hector Garcia-Molina,et al.  Entity resolution with evolving rules , 2010, Proc. VLDB Endow..

[12]  Tom White,et al.  Hadoop: The Definitive Guide , 2009 .

[13]  Jianmin Wang,et al.  MapDupReducer: detecting near duplicates over massive datasets , 2010, SIGMOD Conference.

[14]  Salvatore J. Stolfo,et al.  The merge/purge problem for large databases , 1995, SIGMOD '95.

[15]  Roberto J. Bayardo,et al.  Scaling up all pairs similarity search , 2007, WWW '07.

[16]  Charles Elkan,et al.  The Field Matching Problem: Algorithms and Applications , 1996, KDD.

[17]  Xiaoyong Du,et al.  Efficient Duplicate Detection on Cloud Using a New Signature Scheme , 2011, WAIM.

[18]  Lifang Gu,et al.  Record Linkage: Current Practice and Future Directions , 2003 .

[19]  Jayant Madhavan,et al.  Reference reconciliation in complex information spaces , 2005, SIGMOD '05.

[20]  Erhard Rahm,et al.  Data Cleaning: Problems and Current Approaches , 2000, IEEE Data Eng. Bull..

[21]  Ahmed K. Elmagarmid,et al.  Duplicate Record Detection: A Survey , 2007, IEEE Transactions on Knowledge and Data Engineering.

[22]  Georgia Koutrika,et al.  Entity resolution with iterative blocking , 2009, SIGMOD Conference.

[23]  Ian H. Witten,et al.  Managing Gigabytes: Compressing and Indexing Documents and Images , 1999 .

[24]  P. Ivax,et al.  A THEORY FOR RECORD LINKAGE , 2004 .

[25]  Raghav Kaushik,et al.  Efficient exact set-similarity joins , 2006, VLDB.

[26]  Salvatore J. Stolfo,et al.  Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem , 1998, Data Mining and Knowledge Discovery.

[27]  Anuradha Bhamidipaty,et al.  Interactive deduplication using active learning , 2002, KDD.

[28]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[29]  William E. Winkler,et al.  The State of Record Linkage and Current Research Problems , 1999 .

[30]  Torben Bach Pedersen,et al.  Multidimensional Databases and Data Warehousing , 2010, Multidimensional Databases and Data Warehousing.