论文信息 - MapDupReducer: detecting near duplicates over massive datasets

MapDupReducer: detecting near duplicates over massive datasets

Near duplicate detection benefits many applications, e.g., on-line news selection over the Web by keyword search. The purpose of this demo is to show the design and implementation of MapDupReducer, a MapReduce based system capable of detecting near duplicates over massive datasets efficiently.

[1] Sanjay Ghemawat,et al. MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[2] Joshua Alspector,et al. Improved robustness of signature-based near-replica detection via lexicon randomization , 2004, KDD.

[3] Jeffrey Xu Yu,et al. Efficient similarity joins for near-duplicate detection , 2011, TODS.

[4] Jimmy J. Lin. Brute force and indexed approaches to pairwise document similarity comparisons with MapReduce , 2009, SIGIR.

[5] Ahmed K. Elmagarmid,et al. Duplicate Record Detection: A Survey , 2007, IEEE Transactions on Knowledge and Data Engineering.