Detecting near-duplicates for web crawling
暂无分享,去创建一个
[1] David A. Huffman,et al. A method for the construction of minimum-redundancy codes , 1952, Proceedings of the IRE.
[2] Andrew Chi-Chih Yao,et al. The complexity of searching an ordered random table , 1976, 17th Annual Symposium on Foundations of Computer Science (sfcs 1976).
[3] T. Landauer,et al. Indexing by Latent Semantic Analysis , 1990 .
[4] Brenda S. Baker,et al. A theory of parameterized pattern matching: algorithms and applications , 1993, STOC.
[5] F. Frances Yao,et al. Multi-index hashing for information retrieval , 1994, Proceedings 35th Annual Symposium on Foundations of Computer Science.
[6] Noam Nisan,et al. Neighborhood preserving hashing and approximate queries , 1994, SODA '94.
[7] Udi Manber,et al. Finding Similar Files in a Large File System , 1994, USENIX Winter.
[8] Brenda S. Baker,et al. On finding duplication and near-duplication in large software systems , 1995, Proceedings of 2nd Working Conference on Reverse Engineering.
[9] Hector Garcia-Molina,et al. SCAM: A Copy Detection Mechanism for Digital Documents , 1995, DL.
[10] Hector Garcia-Molina,et al. Copy detection mechanisms for digital documents , 1995, SIGMOD '95.
[11] Leszek Gasieniec,et al. Approximate Dictionary Queries , 1996, CPM.
[12] Andrei Z. Broder,et al. On the resemblance and containment of documents , 1997, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171).
[13] Andrew Chi-Chih Yao,et al. Dictionary Look-Up with One Error , 1997, J. Algorithms.
[14] Geoffrey Zweig,et al. Syntactic Clustering of the Web , 1997, Comput. Networks.
[15] Sergey Brin,et al. The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.
[16] Hector Garcia-Molina,et al. Efficient Crawling Through URL Ordering , 1998, Comput. Networks.
[17] Alan M. Frieze,et al. Min-wise independent permutations (extended abstract) , 1998, STOC '98.
[18] M. KleinbergJon. Authoritative sources in a hyperlinked environment , 1999 .
[19] Gerth Stølting Brodal,et al. Improved Bounds for Dictionary Look-up with One Error , 1999 .
[20] Andrei Z. Broder,et al. Mirror, Mirror on the Web: A Study of Host Pairs with Replicated Content , 1999, Comput. Networks.
[21] Monika Henzinger,et al. Finding Related Pages in the World Wide Web , 1999, Comput. Networks.
[22] Ravi Kumar,et al. Trawling the Web for Emerging Cyber-Communities , 1999, Comput. Networks.
[23] Andrei Z. Broder,et al. A Comparison of Techniques to Find Mirrored Hosts on the WWW , 2000, IEEE Data Eng. Bull..
[24] S. Muthukrishnan,et al. Selectively estimation for Boolean queries , 2000, PODS '00.
[25] Alan M. Frieze,et al. Min-Wise Independent Permutations , 2000, J. Comput. Syst. Sci..
[26] Piotr Indyk,et al. Scalable Techniques for Clustering the Web , 2000, WebDB.
[27] Edith Cohen,et al. Finding interesting associations without support pruning , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).
[28] Marco Gori,et al. Focused Crawling Using Context Graphs , 2000, VLDB.
[29] MaziéresDavid,et al. A low-bandwidth network file system , 2001 .
[30] Sriram Raghavan,et al. Searching the Web , 2001, ACM Trans. Internet Techn..
[31] Dimitrios Gunopulos,et al. Efficient and tumble similar set retrieval , 2001, SIGMOD '01.
[32] Filippo Menczer,et al. Evaluating topic-driven web crawlers , 2001, SIGIR '01.
[33] Moses Charikar,et al. Similarity estimation techniques from rounding algorithms , 2002, STOC '02.
[34] James W. Cooper,et al. Detecting similar documents using salient terms , 2002, CIKM '02.
[35] Sean Quinlan,et al. Venti: A New Approach to Archival Storage , 2002, FAST.
[36] Ömer Egecioglu,et al. Dictionary Look-Up within Small Edit Distance , 2002, COCOON.
[37] Ophir Frieder,et al. Collection statistics for fast duplicate document detection , 2002, TOIS.
[38] Dan Klein,et al. Evaluating strategies for similarity search on the web , 2002, WWW '02.
[39] Sachindra Joshi,et al. A bag of paths model for measuring structural similarity in Web documents , 2003, KDD '03.
[40] Daniel Shawcross Wilkerson,et al. Winnowing: local algorithms for document fingerprinting , 2003, SIGMOD '03.
[41] Hector Garcia-Molina,et al. Extracting structured data from Web pages , 2003, SIGMOD '03.
[42] Justin Zobel,et al. Methods for Identifying Versioned and Plagiarized Documents , 2003, J. Assoc. Inf. Sci. Technol..
[43] GhemawatSanjay,et al. The Google file system , 2003 .
[44] Andrei Z. Broder,et al. Efficient URL caching for world wide web crawling , 2003, WWW '03.
[45] Mohamed S. Kamel,et al. Efficient phrase-based document indexing for Web document clustering , 2004, IEEE Transactions on Knowledge and Data Engineering.
[46] Chaomei Chen,et al. Mining the Web: Discovering knowledge from hypertext data , 2004, J. Assoc. Inf. Sci. Technol..
[47] Laurel Howe. Mirror , 2004 .
[48] Joshua Alspector,et al. Improved robustness of signature-based near-replica detection via lexicon randomization , 2004, KDD.
[49] Sanjay Ghemawat,et al. MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.
[50] Jack G. Conrad,et al. Constructing a text corpus for inexact duplicate detection , 2004, SIGIR '04.
[51] Sandeep Pandey,et al. User-centric Web crawling , 2005, WWW '05.
[52] Vladik Kreinovich,et al. Mining the Web: Discovering Knowledge from Hypertext Data, by Soumen Chakrabarti and Morgan Kaufmann , 2005, J. Intell. Fuzzy Syst..
[53] Monika Henzinger,et al. Finding near-duplicate web pages: a large-scale evaluation of algorithms , 2006, SIGIR.