A novel web page duplication detection framework

There are a lot of redundant web pages on Internet. Based on tag statistic and text similarity comparison, we present a novel multilayer framework for detecting duplicated web pages in this paper. We propose two similarity text paragraphs detection algorithms and implement our framework. The experimental results show that our approach achieves high performance, which means that duplicated web pages can be efficiently detected simply by tag statistic and text comparison.

[1]  Hector Garcia-Molina,et al.  SCAM: A Copy Detection Mechanism for Digital Documents , 1995, DL.

[2]  Peter Willett,et al.  Identification of duplicate and near‐duplicate full‐text records in database search‐outputs using hierarchic cluster analysis , 1995 .

[3]  Hector Garcia-Molina,et al.  Finding replicated Web collections , 2000, SIGMOD '00.

[4]  Andrei Z. Broder,et al.  Mirror, Mirror on the Web: A Study of Host Pairs with Replicated Content , 1999, Comput. Networks.

[5]  Mo Qian Research on methods for extracting text information from HTML pages , 2008 .

[6]  Monika Henzinger,et al.  Finding near-duplicate web pages: a large-scale evaluation of algorithms , 2006, SIGIR.

[7]  Hassan Artail,et al.  A fast HTML web page change detection approach based on hashing and reducing the number of similarity computations , 2008, Data Knowl. Eng..

[8]  Xiaoli Li,et al.  Learning to Classify Texts Using Positive and Unlabeled Data , 2003, IJCAI.

[9]  Claire Cardie,et al.  The Smart/Empire TIPSTER IR System , 1998, TIPSTER.

[10]  Wei Li,et al.  Web document duplicate removal algorithm based on keyword sequences , 2005, 2005 International Conference on Natural Language Processing and Knowledge Engineering.

[11]  Jan-Ming Ho,et al.  Discovering informative content blocks from Web documents , 2002, KDD.

[12]  Hector Garcia-Molina,et al.  The SIFT information dissemination system , 1999, TODS.

[13]  J.-H. Park,et al.  Dynamic management of URL based on object-oriented paradigm , 1998, Proceedings 1998 International Conference on Parallel and Distributed Systems (Cat. No.98TB100250).

[14]  Udi Manber,et al.  Finding Similar Files in a Large File System , 1994, USENIX Winter.

[15]  Chia-Hui Chang,et al.  Automatic Information Extraction for Multiple Singular Web Pages , 2002, PAKDD.

[16]  Panagiotis G. Ipeirotis,et al.  Automatic Extraction of Useful Facet Hierarchies from Text Databases , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[17]  Li Xiao Two Effective Functions on Hashing URL , 2004 .

[18]  Marc Najork,et al.  On the evolution of clusters of near-duplicate Web pages , 2003, Proceedings of the IEEE/LEOS 3rd International Conference on Numerical Simulation of Semiconductor Optoelectronic Devices (IEEE Cat. No.03EX726).

[19]  Qian Mo,et al.  Effectively and efficiently detect web page duplication , 2009, 2009 Fourth International Conference on Digital Information Management.

[20]  Wolfgang Gatterbauer,et al.  Using visual cues for extraction of tabular data from arbitrary HTML documents , 2005, WWW '05.

[21]  Hector Garcia-Molina,et al.  Finding replicated Web collections , 2000, SIGMOD 2000.

[22]  Larry Spitz,et al.  Duplicate document detection , 1997, Electronic Imaging.