Hadoop Based Parallel Deduplication Method for Web Documents

This paper proposes a method of deleting duplicate web pages through tf-idf and splay tree. According to the keywords which are extracted by TextRank, those pages which may be duplicate copies will be sent to a group. Then these pages will be judged by the method above. We use three Map-Reduce tasks to ensure the method of calculating tf-idf and deleting duplicate web pages. The experiment result shows that the algorithm can remove duplicate web pages efficiently and accurately.