Efficient partial-duplicate detection based on sequence matching
暂无分享,去创建一个
Xuanjing Huang | Yue Zhang | Qi Zhang | Haomin Yu | Xuanjing Huang | Qi Zhang | Yue Zhang | Haomin Yu
[1] Andrei Z. Broder,et al. On the resemblance and containment of documents , 1997, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171).
[2] Leslie G. Valiant,et al. A bridging model for parallel computation , 1990, CACM.
[3] Lynette Hirschman,et al. MITRE: Description of the Alembic System Used for MUC-6 , 1995, MUC.
[4] Joshua Alspector,et al. Improved robustness of signature-based near-replica detection via lexicon randomization , 2004, KDD.
[5] Piotr Indyk,et al. Approximate nearest neighbors: towards removing the curse of dimensionality , 1998, STOC '98.
[6] Ricardo Baeza-Yates,et al. Information Retrieval: Data Structures and Algorithms , 1992 .
[7] Andrei Z. Broder,et al. Identifying and Filtering Near-Duplicate Documents , 2000, CPM.
[8] Alexander Löser,et al. Near-duplicate detection for web-forums , 2009, IDEAS '09.
[9] Geoffrey Zweig,et al. Syntactic Clustering of the Web , 1997, Comput. Networks.
[10] Venkata Subramaniam,et al. Information Retrieval: Data Structures & Algorithms , 1992 .
[11] Hector Garcia-Molina,et al. Finding Near-Replicas of Documents and Servers on the Web , 1998, WebDB.
[12] Daniel Shawcross Wilkerson,et al. Winnowing: local algorithms for document fingerprinting , 2003, SIGMOD '03.
[13] Andreas Paepcke,et al. SpotSigs: robust and efficient near duplicate detection in large web collections , 2008, SIGIR '08.
[14] GhemawatSanjay,et al. The Google file system , 2003 .
[15] Andrew Tomkins,et al. Toward a PeopleWeb , 2007, Computer.
[16] Ophir Frieder,et al. Collection statistics for fast duplicate document detection , 2002, TOIS.
[17] Bill N. Schilit,et al. Generating links by mining quotations , 2008, Hypertext.
[18] Sanjay Ghemawat,et al. MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.
[19] Monika Henzinger,et al. Finding near-duplicate web pages: a large-scale evaluation of algorithms , 2006, SIGIR.
[20] W. Bruce Croft,et al. Local text reuse detection , 2008, SIGIR '08.
[21] Inna Kouper,et al. Longitudinal Content Analysis of Blogs: 2003–2004 , 2012 .
[22] W. Bruce Croft,et al. Finding text reuse on the web , 2009, WSDM '09.
[23] Hector Garcia-Molina,et al. SCAM: A Copy Detection Mechanism for Digital Documents , 1995, DL.
[24] Piotr Indyk,et al. Similarity Search in High Dimensions via Hashing , 1999, VLDB.
[25] William Gropp,et al. Skjellum using mpi: portable parallel programming with the message-passing interface , 1994 .
[26] Gurmeet Singh Manku,et al. Detecting near-duplicates for web crawling , 2007, WWW '07.
[27] Moses Charikar,et al. Similarity estimation techniques from rounding algorithms , 2002, STOC '02.
[28] Jimmy J. Lin. Brute force and indexed approaches to pairwise document similarity comparisons with MapReduce , 2009, SIGIR.