论文信息 - Detecting near-duplicate documents using sentence-level features and supervised learning - 字舞流文

Detecting near-duplicate documents using sentence-level features and supervised learning

Shie-Jue Lee | Yung-Shen Lin | Ting-Yi Liao | Yung-Shen Lin | Ting-Yi Liao | Shie-Jue Lee

[1] Raghav Kaushik,et al. Efficient exact set-similarity joins , 2006, VLDB.

[2] Bin Wang,et al. VGRAM: Improving Performance of Approximate Queries on String Collections Using Variable-Length Grams , 2007, VLDB.

[3] Andreas Paepcke,et al. SpotSigs: robust and efficient near duplicate detection in large web collections , 2008, SIGIR '08.

[4] Ronald Fagin,et al. Efficient similarity search and classification via rank aggregation , 2003, SIGMOD '03.

[5] Xuemin Lin,et al. SPARK2: Top-k Keyword Query in Relational Databases , 2007, IEEE Transactions on Knowledge and Data Engineering.

[6] Gurmeet Singh Manku,et al. Detecting near-duplicates for web crawling , 2007, WWW '07.

[7] Marc Najork,et al. On the evolution of clusters of near-duplicate Web pages , 2003, Proceedings of the IEEE/LEOS 3rd International Conference on Numerical Simulation of Semiconductor Optoelectronic Devices (IEEE Cat. No.03EX726).

[8] Ricardo A. Baeza-Yates,et al. Where and How Duplicates Occur in the Web , 2006, 2006 Fourth Latin American Web Congress.

[9] Lei Wang,et al. On Similarity Preserving Feature Selection , 2013, IEEE Transactions on Knowledge and Data Engineering.

[10] Monika Henzinger,et al. Finding near-duplicate web pages: a large-scale evaluation of algorithms , 2006, SIGIR.

[11] Christopher D. Manning,et al. Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[12] Marcos André Gonçalves,et al. A Genetic Programming Approach to Record Deduplication , 2012, IEEE Transactions on Knowledge and Data Engineering.

[13] Sanjay Ghemawat,et al. MapReduce: simplified data processing on large clusters , 2008, CACM.

[14] Tetsuya Sakai,et al. Proceedings of the 5th Asia Information Retrieval Symposium on Information Retrieval Technology , 2009 .

[15] T. Martin McGinnity,et al. A Context-Based Word Indexing Model for Document Summarization , 2013, IEEE Transactions on Knowledge and Data Engineering.

[16] Timothy W. Finin,et al. Improving Word Similarity by Augmenting PMI with Estimates of Word Polysemy , 2013, IEEE Transactions on Knowledge and Data Engineering.

[17] Wen-tau Yih,et al. Adaptive near-duplicate detection via similarity learning , 2010, SIGIR.

[18] 魏小敏,et al. Bag of Words算法框架的研究 , 2011 .

[19] Jeffrey Xu Yu,et al. Efficient similarity joins for near-duplicate detection , 2011, TODS.

[20] Jongik Kim,et al. Efficient Exact Similarity Searches Using Multiple Token Orderings , 2012, 2012 IEEE 28th International Conference on Data Engineering.

[21] Bruno Martins. A Supervised Machine Learning Approach for Duplicate Detection over Gazetteer Records , 2011, GeoS.

[22] Xuemin Lin,et al. Top-k Set Similarity Joins , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[23] Jenq-Haur Wang,et al. Exploiting Sentence-Level Features for Near-Duplicate Document Detection , 2009, AIRS.

[24] Xueqi Cheng,et al. Detecting Near-Duplicates in Large-Scale Short Text Databases , 2008, PAKDD.

[25] Jugal K. Kalita,et al. Cutting Plane Training for Linear Support Vector Machines , 2013, IEEE Transactions on Knowledge and Data Engineering.

[26] Grace Hui Yang,et al. Near-duplicate detection by instance-level constrained clustering , 2006, SIGIR.

[27] Junping Qiu,et al. Detection and optimized disposal of near-duplicate pages , 2010, 2010 2nd International Conference on Future Computer and Communication.

[28] Hector Garcia-Molina,et al. Copy detection mechanisms for digital documents , 1995, SIGMOD '95.

[29] Maosong Sun,et al. Semi-Supervised SimHash for Efficient Document Similarity Search , 2011, ACL.

[30] Barbara Plank,et al. Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies , 2011 .

[31] A. Govardhan,et al. A Novel and Efficient Approach For Near Duplicate Page Detection in Web Crawling , 2009, 2009 IEEE International Advance Computing Conference.

[32] Paolo Rosso,et al. Detection of near-duplicate user generated contents: the SMS spam collection , 2011, SMUC '11.

[33] Lei Wang,et al. Achieving both high precision and high recall in near-duplicate detection , 2008, CIKM '08.

[34] Andrei Z. Broder,et al. Identifying and Filtering Near-Duplicate Documents , 2000, CPM.

[35] Dmitri Loguinov,et al. Probabilistic near-duplicate detection using simhash , 2011, CIKM '11.

[36] Fan Yang,et al. Multiple-signal duplicate detection for search evaluation , 2007, SIGIR.

[37] Ophir Frieder,et al. Collection statistics for fast duplicate document detection , 2002, TOIS.

[38] Jack G. Conrad,et al. Online duplicate document detection: signature reliability in a dynamic retrieval environment , 2003, CIKM '03.

[39] Roberto J. Bayardo,et al. Scaling up all pairs similarity search , 2007, WWW '07.

[40] Grace Hui Yang,et al. Near-duplicate detection for eRulemaking , 2005, DG.O.