Learning URL patterns for webpage de-duplication
暂无分享,去创建一个
Hema Swetha Koppula | Sachin Garg | Amit Agarwal | Amit Sasturkar | Krishna P. Leela | Krishna Prasad Chitrapura | H. Koppula | S. Garg | Amit Agarwal | K. P. Chitrapura | Amit Sasturkar
[1] Tim Berners-Lee,et al. Uniform Resource Locators (URL) , 1994, RFC.
[2] Dan Gusfield,et al. Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .
[3] Andrei Z. Broder,et al. On the resemblance and containment of documents , 1997, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171).
[4] Rajeev Motwani,et al. The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.
[5] Moses Charikar,et al. Similarity estimation techniques from rounding algorithms , 2002, STOC '02.
[6] Marc Najork,et al. On the evolution of clusters of near-duplicate Web pages , 2003, Proceedings of the IEEE/LEOS 3rd International Conference on Numerical Simulation of Semiconductor Optoelectronic Devices (IEEE Cat. No.03EX726).
[7] Gene H. Golub,et al. Exploiting the Block Structure of the Web for Computing , 2003 .
[8] Serge Abiteboul,et al. Adaptive on-line page importance computation , 2003, WWW '03.
[9] Steve Lawrence,et al. Extracting knowledge from the World Wide Web , 2004, Proceedings of the National Academy of Sciences of the United States of America.
[10] J. Ross Quinlan,et al. Induction of Decision Trees , 1986, Machine Learning.
[11] Sanjay Ghemawat,et al. MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.
[12] Min-Yen Kan,et al. Fast webpage classification using URL features , 2005, CIKM '05.
[13] Monika Henzinger,et al. Finding near-duplicate web pages: a large-scale evaluation of algorithms , 2006, SIGIR.
[14] I. Keidar,et al. Do not crawl in the DUST: Different URLs with similar text , 2006, TWEB.
[15] Idit Keidar,et al. Do not crawl in the DUST: different URLs with similar text , 2006, WWW.
[16] Gurmeet Singh Manku,et al. Detecting near-duplicates for web crawling , 2007, WWW '07.
[17] Anirban Dasgupta,et al. De-duping URLs via rewrite rules , 2008, KDD.
[18] Monika Henzinger,et al. Purely URL-based topic classification , 2009, WWW '09.
[19] Hema Swetha Koppula,et al. URL normalization for de-duplication of web pages , 2009, CIKM.