Do not crawl in the dust: different urls with similar text

We consider the problem of DUST: Different URLs with Similar Text. Such duplicate URLs are prevalent in web sites, as web server software often uses aliases and redirections, and dynamically generates the same page from various different URLrequests. We present a novel algorithm, DustBuster, for uncovering DUST; that is, for discovering rules that transform a given URL to others that are likely to have similar content. DustBuster mines DUST effectively from previous crawl logs or web server logs, without examining page contents. Verifying these rules via sampling requires fetching few actual web pages. Search engines can benefit from information about DUST to increase the effectiveness of crawling, reduce indexing overhead, and improve the quality of popularity statistics such as PageRank.

[1]  E. Harder,et al.  Apache , 1965 .

[2]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[3]  Ronald L. Rivest,et al.  The MD5 Message-Digest Algorithm , 1992, RFC.

[4]  R. Agarwal Fast Algorithms for Mining Association Rules , 1994, VLDB 1994.

[5]  Hector Garcia-Molina,et al.  Copy detection mechanisms for digital documents , 1995, SIGMOD '95.

[6]  Luis Gravano,et al.  dSCAM: finding document copies across multiple databases , 1996, Fourth International Conference on Parallel and Distributed Information Systems.

[7]  Dan Gusfield Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[8]  T. Landauer,et al.  A Solution to Plato's Problem: The Latent Semantic Analysis Theory of Acquisition, Induction, and Representation of Knowledge. , 1997 .

[9]  David W. Conrath,et al.  Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy , 1997, ROCLING/IJCLCLP.

[10]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[11]  Anja Feldmann,et al.  Rate of Change and other Metrics: a Live Study of the World Wide Web , 1997, USENIX Symposium on Internet Technologies and Systems.

[12]  Geoffrey Zweig,et al.  Syntactic Clustering of the Web , 1997, Comput. Networks.

[13]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[14]  Hector Garcia-Molina,et al.  Finding Near-Replicas of Documents and Servers on the Web , 1998, WebDB.

[15]  Roy T. Fielding,et al.  Uniform Resource Identifiers (URI): Generic Syntax , 1998, RFC.

[16]  Alistair Moffat,et al.  Exploring the similarity space , 1998, SIGF.

[17]  Rüdiger Reischuk,et al.  Learning one-variable pattern languages in linear average time , 1997, COLT' 98.

[18]  Dekang Lin,et al.  Automatic Retrieval and Clustering of Similar Words , 1998, ACL.

[19]  Mark Levene,et al.  Data Mining of User Navigation Patterns , 1999, WEBKDD.

[20]  Andrei Z. Broder,et al.  Mirror, Mirror on the Web: A Study of Host Pairs with Replicated Content , 1999, Comput. Networks.

[21]  Hector Garcia-Molina,et al.  Finding replicated Web collections , 2000, SIGMOD '00.

[22]  Andrei Z. Broder,et al.  A Comparison of Techniques to Find Mirrored Hosts on the WWW , 2000, IEEE Data Eng. Bull..

[23]  Colin de la Higuera,et al.  Current Trends in Grammatical Inference , 2000, SSPR/SPR.

[24]  Alvin S. Lim,et al.  A URL-String-Based Algorithm for Finding WWW Mirror Hosts , 2001 .

[25]  Peter D. Turney Mining the Web for Synonyms: PMI-IR versus LSA on TOEFL , 2001, ECML.

[26]  Arkady B. Zaslavsky,et al.  Signature Extraction for Overlap Detection in Documents , 2002, ACSC.

[27]  Máté Pataki,et al.  Comparison of Overlap Detection Techniques , 2002, International Conference on Computational Science.

[28]  Thorsten Joachims,et al.  Optimizing search engines using clickthrough data , 2002, KDD.

[29]  Terence Kelly,et al.  Aliasing on the world wide web: prevalence and performance implications , 2002, WWW '02.

[30]  Marco Gori,et al.  Detecting near-replicas on the Web by content and hyperlink analysis , 2003, Proceedings IEEE/WIC International Conference on Web Intelligence (WI 2003).

[31]  Justin Zobel,et al.  Methods for Identifying Versioned and Plagiarized Documents , 2003, J. Assoc. Inf. Sci. Technol..

[32]  Michael Dahlin,et al.  Using Bloom Filters to Refine Web Search Results , 2005, WebDB.

[33]  Sang Ho Lee,et al.  Reliable Evaluations of URL Normalization , 2006, ICCSA.

[34]  Idit Keidar,et al.  Do not crawl in the DUST: different URLs with similar text , 2006, WWW.

[35]  Michael L. Nelson,et al.  Evaluation of crawling policies for a web-repository crawler , 2006, HYPERTEXT '06.