Detecting near-duplicate documents using sentence-level features and supervised learning
暂无分享,去创建一个
[1] Raghav Kaushik,et al. Efficient exact set-similarity joins , 2006, VLDB.
[2] Bin Wang,et al. VGRAM: Improving Performance of Approximate Queries on String Collections Using Variable-Length Grams , 2007, VLDB.
[3] Andreas Paepcke,et al. SpotSigs: robust and efficient near duplicate detection in large web collections , 2008, SIGIR '08.
[4] Ronald Fagin,et al. Efficient similarity search and classification via rank aggregation , 2003, SIGMOD '03.
[5] Xuemin Lin,et al. SPARK2: Top-k Keyword Query in Relational Databases , 2007, IEEE Transactions on Knowledge and Data Engineering.
[6] Gurmeet Singh Manku,et al. Detecting near-duplicates for web crawling , 2007, WWW '07.
[7] Marc Najork,et al. On the evolution of clusters of near-duplicate Web pages , 2003, Proceedings of the IEEE/LEOS 3rd International Conference on Numerical Simulation of Semiconductor Optoelectronic Devices (IEEE Cat. No.03EX726).
[8] Ricardo A. Baeza-Yates,et al. Where and How Duplicates Occur in the Web , 2006, 2006 Fourth Latin American Web Congress.
[9] Lei Wang,et al. On Similarity Preserving Feature Selection , 2013, IEEE Transactions on Knowledge and Data Engineering.
[10] Monika Henzinger,et al. Finding near-duplicate web pages: a large-scale evaluation of algorithms , 2006, SIGIR.
[11] Christopher D. Manning,et al. Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..
[12] Marcos André Gonçalves,et al. A Genetic Programming Approach to Record Deduplication , 2012, IEEE Transactions on Knowledge and Data Engineering.
[13] Sanjay Ghemawat,et al. MapReduce: simplified data processing on large clusters , 2008, CACM.
[14] Tetsuya Sakai,et al. Proceedings of the 5th Asia Information Retrieval Symposium on Information Retrieval Technology , 2009 .
[15] T. Martin McGinnity,et al. A Context-Based Word Indexing Model for Document Summarization , 2013, IEEE Transactions on Knowledge and Data Engineering.
[16] Timothy W. Finin,et al. Improving Word Similarity by Augmenting PMI with Estimates of Word Polysemy , 2013, IEEE Transactions on Knowledge and Data Engineering.
[17] Wen-tau Yih,et al. Adaptive near-duplicate detection via similarity learning , 2010, SIGIR.
[18] 魏小敏,et al. Bag of Words算法框架的研究 , 2011 .
[19] Jeffrey Xu Yu,et al. Efficient similarity joins for near-duplicate detection , 2011, TODS.
[20] Jongik Kim,et al. Efficient Exact Similarity Searches Using Multiple Token Orderings , 2012, 2012 IEEE 28th International Conference on Data Engineering.
[21] Bruno Martins. A Supervised Machine Learning Approach for Duplicate Detection over Gazetteer Records , 2011, GeoS.
[22] Xuemin Lin,et al. Top-k Set Similarity Joins , 2009, 2009 IEEE 25th International Conference on Data Engineering.
[23] Jenq-Haur Wang,et al. Exploiting Sentence-Level Features for Near-Duplicate Document Detection , 2009, AIRS.
[24] Xueqi Cheng,et al. Detecting Near-Duplicates in Large-Scale Short Text Databases , 2008, PAKDD.
[25] Jugal K. Kalita,et al. Cutting Plane Training for Linear Support Vector Machines , 2013, IEEE Transactions on Knowledge and Data Engineering.
[26] Grace Hui Yang,et al. Near-duplicate detection by instance-level constrained clustering , 2006, SIGIR.
[27] Junping Qiu,et al. Detection and optimized disposal of near-duplicate pages , 2010, 2010 2nd International Conference on Future Computer and Communication.
[28] Hector Garcia-Molina,et al. Copy detection mechanisms for digital documents , 1995, SIGMOD '95.
[29] Maosong Sun,et al. Semi-Supervised SimHash for Efficient Document Similarity Search , 2011, ACL.
[30] Barbara Plank,et al. Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies , 2011 .
[31] A. Govardhan,et al. A Novel and Efficient Approach For Near Duplicate Page Detection in Web Crawling , 2009, 2009 IEEE International Advance Computing Conference.
[32] Paolo Rosso,et al. Detection of near-duplicate user generated contents: the SMS spam collection , 2011, SMUC '11.
[33] Lei Wang,et al. Achieving both high precision and high recall in near-duplicate detection , 2008, CIKM '08.
[34] Andrei Z. Broder,et al. Identifying and Filtering Near-Duplicate Documents , 2000, CPM.
[35] Dmitri Loguinov,et al. Probabilistic near-duplicate detection using simhash , 2011, CIKM '11.
[36] Fan Yang,et al. Multiple-signal duplicate detection for search evaluation , 2007, SIGIR.
[37] Ophir Frieder,et al. Collection statistics for fast duplicate document detection , 2002, TOIS.
[38] Jack G. Conrad,et al. Online duplicate document detection: signature reliability in a dynamic retrieval environment , 2003, CIKM '03.
[39] Roberto J. Bayardo,et al. Scaling up all pairs similarity search , 2007, WWW '07.
[40] Grace Hui Yang,et al. Near-duplicate detection for eRulemaking , 2005, DG.O.