URL normalization for de-duplication of web pages
暂无分享,去创建一个
Hema Swetha Koppula | Sachin Garg | Amit Agarwal | Anirban Roy | Amit Sasturkar | Krishna P. Leela | Krishna Prasad Chitrapura | GM PavanKumar | Chittaranjan Haty
[1] Andrei Z. Broder,et al. On the resemblance and containment of documents , 1997, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171).
[2] Gurmeet Singh Manku,et al. Detecting near-duplicates for web crawling , 2007, WWW '07.
[3] Idit Keidar,et al. Do not crawl in the DUST: different URLs with similar text , 2006, WWW.
[4] Dan Gusfield,et al. Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .
[5] Tim Berners-Lee,et al. Uniform Resource Locators (URL) , 1994, RFC.
[6] Monika Henzinger,et al. Finding near-duplicate web pages: a large-scale evaluation of algorithms , 2006, SIGIR.
[7] Rajeev Motwani,et al. The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.
[8] Anirban Dasgupta,et al. De-duping URLs via rewrite rules , 2008, KDD.
[9] Serge Abiteboul,et al. Adaptive on-line page importance computation , 2003, WWW '03.
[10] J. Ross Quinlan,et al. Induction of Decision Trees , 1986, Machine Learning.
[11] Idit Keidar,et al. Do not crawl in the dust: different urls with similar text , 2006, WWW '07.
[12] Marc Najork,et al. On the evolution of clusters of near-duplicate Web pages , 2003, Proceedings of the IEEE/LEOS 3rd International Conference on Numerical Simulation of Semiconductor Optoelectronic Devices (IEEE Cat. No.03EX726).
[13] Gene H. Golub,et al. Exploiting the Block Structure of the Web for Computing , 2003 .