Learning URL patterns for webpage de-duplication

Presence of duplicate documents in the World Wide Web adversely affects crawling, indexing and relevance, which are the core building blocks of web search. In this paper, we present a set of techniques to mine rules from URLs and utilize these rules for de-duplication using just URL strings without fetching the content explicitly. Our technique is composed of mining the crawl logs and utilizing clusters of similar pages to extract transformation rules, which are used to normalize URLs belonging to each cluster. Preserving each mined rule for de-duplication is not efficient due to the large number of such rules. We present a machine learning technique to generalize the set of rules, which reduces the resource footprint to be usable at web-scale. The rule extraction techniques are robust against web-site specific URL conventions. We compare the precision and scalability of our approach with recent efforts in using URLs for de-duplication. Experimental results demonstrate that our approach achieves 2 times more reduction in duplicates with only half the rules compared to the most recent previous approach. Scalability of the framework is demonstrated by performing a large scale evaluation on a set of 3 Billion URLs, implemented using the MapReduce framework.

[1]  Tim Berners-Lee,et al.  Uniform Resource Locators (URL) , 1994, RFC.

[2]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[3]  Andrei Z. Broder,et al.  On the resemblance and containment of documents , 1997, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171).

[4]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[5]  Moses Charikar,et al.  Similarity estimation techniques from rounding algorithms , 2002, STOC '02.

[6]  Marc Najork,et al.  On the evolution of clusters of near-duplicate Web pages , 2003, Proceedings of the IEEE/LEOS 3rd International Conference on Numerical Simulation of Semiconductor Optoelectronic Devices (IEEE Cat. No.03EX726).

[7]  Gene H. Golub,et al.  Exploiting the Block Structure of the Web for Computing , 2003 .

[8]  Serge Abiteboul,et al.  Adaptive on-line page importance computation , 2003, WWW '03.

[9]  Steve Lawrence,et al.  Extracting knowledge from the World Wide Web , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[10]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.

[11]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[12]  Min-Yen Kan,et al.  Fast webpage classification using URL features , 2005, CIKM '05.

[13]  Monika Henzinger,et al.  Finding near-duplicate web pages: a large-scale evaluation of algorithms , 2006, SIGIR.

[14]  I. Keidar,et al.  Do not crawl in the DUST: Different URLs with similar text , 2006, TWEB.

[15]  Idit Keidar,et al.  Do not crawl in the DUST: different URLs with similar text , 2006, WWW.

[16]  Gurmeet Singh Manku,et al.  Detecting near-duplicates for web crawling , 2007, WWW '07.

[17]  Anirban Dasgupta,et al.  De-duping URLs via rewrite rules , 2008, KDD.

[18]  Monika Henzinger,et al.  Purely URL-based topic classification , 2009, WWW '09.

[19]  Hema Swetha Koppula,et al.  URL normalization for de-duplication of web pages , 2009, CIKM.