With the blooming development of social network, Internet turns into the most widely information source. However, there are a large amount of duplicated web pages most of which are from being reprinted. Border et al. used to do an experiment on a collection of 30,000,000 HTML and text documents. It turned out that nearly 18 % of the pages are exactly the same and 41 % of the pages share 51 % similarity. These replicas of web pages has brought a major burden for the search engines and affecting the performance of the search engines badly. So elimination of duplicated web pages has become a very hot spot in information retrieval field in these years. In this paper, we have proposed a function word(FW) based approach which involves the concept of Bloom Filter(BF) to eliminate duplicated web pages without extracting the web main text. Our approach involves three separate stages. Stage 1 is to extract sample text according to function words feature in web pages. In stage 2, the feature code is extracted using function words. In stage 3, the duplicated web pages would be eliminated by similarity calculation of their BloomFilters.
[1]
Michael Mitzenmacher,et al.
Compressed bloom filters
,
2002,
TNET.
[2]
ZhengYou Xia,et al.
Community detection based on a semantic network
,
2012,
Knowl. Based Syst..
[3]
Ding Zhen-Guo,et al.
Research of large-scale URL Filter Base on Bloom Filter
,
2008
.
[4]
Grace Hui Yang,et al.
Near-duplicate detection for eRulemaking
,
2005,
DG.O.
[5]
Grace Hui Yang,et al.
Next steps in near-duplicate detection for eRulemaking
,
2006,
DG.O.
[6]
Chengcui Zhang,et al.
A last updating evolution model for online social networks
,
2013
.
[7]
Zhengyou Xia,et al.
An FW-DTSS Based Approach for News Page Information Extraction
,
2016,
DMBD.
[8]
Eduardo Sany Laber,et al.
A fast and simple method for extracting relevant content from news webpages
,
2009,
CIKM.
[9]
Zhang Jian,et al.
A study of the identification of authorship for Chinese texts
,
2008,
2008 IEEE International Conference on Intelligence and Security Informatics.