An FW-BF Based Approach on Elimination of Duplicated Web Pages

With the blooming development of social network, Internet turns into the most widely information source. However, there are a large amount of duplicated web pages most of which are from being reprinted. Border et al. used to do an experiment on a collection of 30,000,000 HTML and text documents. It turned out that nearly 18 % of the pages are exactly the same and 41 % of the pages share 51 % similarity. These replicas of web pages has brought a major burden for the search engines and affecting the performance of the search engines badly. So elimination of duplicated web pages has become a very hot spot in information retrieval field in these years. In this paper, we have proposed a function word(FW) based approach which involves the concept of Bloom Filter(BF) to eliminate duplicated web pages without extracting the web main text. Our approach involves three separate stages. Stage 1 is to extract sample text according to function words feature in web pages. In stage 2, the feature code is extracted using function words. In stage 3, the duplicated web pages would be eliminated by similarity calculation of their BloomFilters.