Algorithm of Parallelized Elimination of Duplicated Web Pages Based on Map/Reduce

The module of elimination of duplicated web pages,which filters the web pages downloaded by the crawler module and gets rid of the duplicated pages,is an important part of a search engine.This module can improve the performance of the crawl module and the quality of searching results of a search engine.An algorithm of elimination of duplicated web pages and a strategy based on Map/Reduce are proposed.Its stability and parallel performance in large scale web pages processing is demonstrated when applied to a real web site in our experiment.