A novel web page duplication detection framework
暂无分享,去创建一个
[1] Hector Garcia-Molina,et al. SCAM: A Copy Detection Mechanism for Digital Documents , 1995, DL.
[2] Peter Willett,et al. Identification of duplicate and near‐duplicate full‐text records in database search‐outputs using hierarchic cluster analysis , 1995 .
[3] Hector Garcia-Molina,et al. Finding replicated Web collections , 2000, SIGMOD '00.
[4] Andrei Z. Broder,et al. Mirror, Mirror on the Web: A Study of Host Pairs with Replicated Content , 1999, Comput. Networks.
[5] Mo Qian. Research on methods for extracting text information from HTML pages , 2008 .
[6] Monika Henzinger,et al. Finding near-duplicate web pages: a large-scale evaluation of algorithms , 2006, SIGIR.
[7] Hassan Artail,et al. A fast HTML web page change detection approach based on hashing and reducing the number of similarity computations , 2008, Data Knowl. Eng..
[8] Xiaoli Li,et al. Learning to Classify Texts Using Positive and Unlabeled Data , 2003, IJCAI.
[9] Claire Cardie,et al. The Smart/Empire TIPSTER IR System , 1998, TIPSTER.
[10] Wei Li,et al. Web document duplicate removal algorithm based on keyword sequences , 2005, 2005 International Conference on Natural Language Processing and Knowledge Engineering.
[11] Jan-Ming Ho,et al. Discovering informative content blocks from Web documents , 2002, KDD.
[12] Hector Garcia-Molina,et al. The SIFT information dissemination system , 1999, TODS.
[13] J.-H. Park,et al. Dynamic management of URL based on object-oriented paradigm , 1998, Proceedings 1998 International Conference on Parallel and Distributed Systems (Cat. No.98TB100250).
[14] Udi Manber,et al. Finding Similar Files in a Large File System , 1994, USENIX Winter.
[15] Chia-Hui Chang,et al. Automatic Information Extraction for Multiple Singular Web Pages , 2002, PAKDD.
[16] Panagiotis G. Ipeirotis,et al. Automatic Extraction of Useful Facet Hierarchies from Text Databases , 2008, 2008 IEEE 24th International Conference on Data Engineering.
[17] Li Xiao. Two Effective Functions on Hashing URL , 2004 .
[18] Marc Najork,et al. On the evolution of clusters of near-duplicate Web pages , 2003, Proceedings of the IEEE/LEOS 3rd International Conference on Numerical Simulation of Semiconductor Optoelectronic Devices (IEEE Cat. No.03EX726).
[19] Qian Mo,et al. Effectively and efficiently detect web page duplication , 2009, 2009 Fourth International Conference on Digital Information Management.
[20] Wolfgang Gatterbauer,et al. Using visual cues for extraction of tabular data from arbitrary HTML documents , 2005, WWW '05.
[21] Hector Garcia-Molina,et al. Finding replicated Web collections , 2000, SIGMOD 2000.
[22] Larry Spitz,et al. Duplicate document detection , 1997, Electronic Imaging.