Do not crawl in the dust: different urls with similar text
暂无分享,去创建一个
[1] E. Harder,et al. Apache , 1965 .
[2] David S. Johnson,et al. Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .
[3] Ronald L. Rivest,et al. The MD5 Message-Digest Algorithm , 1992, RFC.
[4] R. Agarwal. Fast Algorithms for Mining Association Rules , 1994, VLDB 1994.
[5] Hector Garcia-Molina,et al. Copy detection mechanisms for digital documents , 1995, SIGMOD '95.
[6] Luis Gravano,et al. dSCAM: finding document copies across multiple databases , 1996, Fourth International Conference on Parallel and Distributed Information Systems.
[7] Dan Gusfield. Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .
[8] T. Landauer,et al. A Solution to Plato's Problem: The Latent Semantic Analysis Theory of Acquisition, Induction, and Representation of Knowledge. , 1997 .
[9] David W. Conrath,et al. Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy , 1997, ROCLING/IJCLCLP.
[10] Dan Gusfield,et al. Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .
[11] Anja Feldmann,et al. Rate of Change and other Metrics: a Live Study of the World Wide Web , 1997, USENIX Symposium on Internet Technologies and Systems.
[12] Geoffrey Zweig,et al. Syntactic Clustering of the Web , 1997, Comput. Networks.
[13] Sergey Brin,et al. The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.
[14] Hector Garcia-Molina,et al. Finding Near-Replicas of Documents and Servers on the Web , 1998, WebDB.
[15] Roy T. Fielding,et al. Uniform Resource Identifiers (URI): Generic Syntax , 1998, RFC.
[16] Alistair Moffat,et al. Exploring the similarity space , 1998, SIGF.
[17] Rüdiger Reischuk,et al. Learning one-variable pattern languages in linear average time , 1997, COLT' 98.
[18] Dekang Lin,et al. Automatic Retrieval and Clustering of Similar Words , 1998, ACL.
[19] Mark Levene,et al. Data Mining of User Navigation Patterns , 1999, WEBKDD.
[20] Andrei Z. Broder,et al. Mirror, Mirror on the Web: A Study of Host Pairs with Replicated Content , 1999, Comput. Networks.
[21] Hector Garcia-Molina,et al. Finding replicated Web collections , 2000, SIGMOD '00.
[22] Andrei Z. Broder,et al. A Comparison of Techniques to Find Mirrored Hosts on the WWW , 2000, IEEE Data Eng. Bull..
[23] Colin de la Higuera,et al. Current Trends in Grammatical Inference , 2000, SSPR/SPR.
[24] Alvin S. Lim,et al. A URL-String-Based Algorithm for Finding WWW Mirror Hosts , 2001 .
[25] Peter D. Turney. Mining the Web for Synonyms: PMI-IR versus LSA on TOEFL , 2001, ECML.
[26] Arkady B. Zaslavsky,et al. Signature Extraction for Overlap Detection in Documents , 2002, ACSC.
[27] Máté Pataki,et al. Comparison of Overlap Detection Techniques , 2002, International Conference on Computational Science.
[28] Thorsten Joachims,et al. Optimizing search engines using clickthrough data , 2002, KDD.
[29] Terence Kelly,et al. Aliasing on the world wide web: prevalence and performance implications , 2002, WWW '02.
[30] Marco Gori,et al. Detecting near-replicas on the Web by content and hyperlink analysis , 2003, Proceedings IEEE/WIC International Conference on Web Intelligence (WI 2003).
[31] Justin Zobel,et al. Methods for Identifying Versioned and Plagiarized Documents , 2003, J. Assoc. Inf. Sci. Technol..
[32] Michael Dahlin,et al. Using Bloom Filters to Refine Web Search Results , 2005, WebDB.
[33] Sang Ho Lee,et al. Reliable Evaluations of URL Normalization , 2006, ICCSA.
[34] Idit Keidar,et al. Do not crawl in the DUST: different URLs with similar text , 2006, WWW.
[35] Michael L. Nelson,et al. Evaluation of crawling policies for a web-repository crawler , 2006, HYPERTEXT '06.