论文信息 - Hadoop Based Parallel Deduplication Method for Web Documents

Hadoop Based Parallel Deduplication Method for Web Documents

This paper proposes a method of deleting duplicate web pages through tf-idf and splay tree. According to the keywords which are extracted by TextRank, those pages which may be duplicate copies will be sent to a group. Then these pages will be judged by the method above. We use three Map-Reduce tasks to ensure the method of calculating tf-idf and deleting duplicate web pages. The experiment result shows that the algorithm can remove duplicate web pages efficiently and accurately.

Yuhui Zheng | Junjie Song | Jin Liu

[1] Wang Jian. Research and Evaluation of Near replicas of Web Pages Detection Algorithms , 2000 .

[2] Geoffrey Zweig,et al. Syntactic Clustering of the Web , 1997, Comput. Networks.

[3] Robert E. Tarjan,et al. Self-adjusting binary search trees , 1985, JACM.

[4] Rajeev Motwani,et al. The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[5] Daniel P. Lopresti,et al. Models and algorithms for duplicate document detection , 1999, Proceedings of the Fifth International Conference on Document Analysis and Recognition. ICDAR '99 (Cat. No.PR00318).

[6] Gerard Salton,et al. Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[7] Michael McGill,et al. Introduction to Modern Information Retrieval , 1983 .

[8] Qian Song-rong. Duplicate Web Page Elimination Based on HTML and Extraction of Long Sentence , 2009 .

[9] Rada Mihalcea,et al. TextRank: Bringing Order into Text , 2004, EMNLP.

[10] Xianghua Xu,et al. Design and Implement of Distributed Document Clustering Based on MapReduce , 2009 .

[11] Sanjay Ghemawat,et al. MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[12] Edward A. Fox,et al. Research Contributions , 2014 .