Detecting near-duplicate documents using sentence-level features and supervised learning

[1]  Raghav Kaushik,et al.  Efficient exact set-similarity joins , 2006, VLDB.

[2]  Bin Wang,et al.  VGRAM: Improving Performance of Approximate Queries on String Collections Using Variable-Length Grams , 2007, VLDB.

[3]  Andreas Paepcke,et al.  SpotSigs: robust and efficient near duplicate detection in large web collections , 2008, SIGIR '08.

[4]  Ronald Fagin,et al.  Efficient similarity search and classification via rank aggregation , 2003, SIGMOD '03.

[5]  Xuemin Lin,et al.  SPARK2: Top-k Keyword Query in Relational Databases , 2007, IEEE Transactions on Knowledge and Data Engineering.

[6]  Gurmeet Singh Manku,et al.  Detecting near-duplicates for web crawling , 2007, WWW '07.

[7]  Marc Najork,et al.  On the evolution of clusters of near-duplicate Web pages , 2003, Proceedings of the IEEE/LEOS 3rd International Conference on Numerical Simulation of Semiconductor Optoelectronic Devices (IEEE Cat. No.03EX726).

[8]  Ricardo A. Baeza-Yates,et al.  Where and How Duplicates Occur in the Web , 2006, 2006 Fourth Latin American Web Congress.

[9]  Lei Wang,et al.  On Similarity Preserving Feature Selection , 2013, IEEE Transactions on Knowledge and Data Engineering.

[10]  Monika Henzinger,et al.  Finding near-duplicate web pages: a large-scale evaluation of algorithms , 2006, SIGIR.

[11]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[12]  Marcos André Gonçalves,et al.  A Genetic Programming Approach to Record Deduplication , 2012, IEEE Transactions on Knowledge and Data Engineering.

[13]  Sanjay Ghemawat,et al.  MapReduce: simplified data processing on large clusters , 2008, CACM.

[14]  Tetsuya Sakai,et al.  Proceedings of the 5th Asia Information Retrieval Symposium on Information Retrieval Technology , 2009 .

[15]  T. Martin McGinnity,et al.  A Context-Based Word Indexing Model for Document Summarization , 2013, IEEE Transactions on Knowledge and Data Engineering.

[16]  Timothy W. Finin,et al.  Improving Word Similarity by Augmenting PMI with Estimates of Word Polysemy , 2013, IEEE Transactions on Knowledge and Data Engineering.

[17]  Wen-tau Yih,et al.  Adaptive near-duplicate detection via similarity learning , 2010, SIGIR.

[18]  魏小敏,et al.  Bag of Words算法框架的研究 , 2011 .

[19]  Jeffrey Xu Yu,et al.  Efficient similarity joins for near-duplicate detection , 2011, TODS.

[20]  Jongik Kim,et al.  Efficient Exact Similarity Searches Using Multiple Token Orderings , 2012, 2012 IEEE 28th International Conference on Data Engineering.

[21]  Bruno Martins A Supervised Machine Learning Approach for Duplicate Detection over Gazetteer Records , 2011, GeoS.

[22]  Xuemin Lin,et al.  Top-k Set Similarity Joins , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[23]  Jenq-Haur Wang,et al.  Exploiting Sentence-Level Features for Near-Duplicate Document Detection , 2009, AIRS.

[24]  Xueqi Cheng,et al.  Detecting Near-Duplicates in Large-Scale Short Text Databases , 2008, PAKDD.

[25]  Jugal K. Kalita,et al.  Cutting Plane Training for Linear Support Vector Machines , 2013, IEEE Transactions on Knowledge and Data Engineering.

[26]  Grace Hui Yang,et al.  Near-duplicate detection by instance-level constrained clustering , 2006, SIGIR.

[27]  Junping Qiu,et al.  Detection and optimized disposal of near-duplicate pages , 2010, 2010 2nd International Conference on Future Computer and Communication.

[28]  Hector Garcia-Molina,et al.  Copy detection mechanisms for digital documents , 1995, SIGMOD '95.

[29]  Maosong Sun,et al.  Semi-Supervised SimHash for Efficient Document Similarity Search , 2011, ACL.

[30]  Barbara Plank,et al.  Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies , 2011 .

[31]  A. Govardhan,et al.  A Novel and Efficient Approach For Near Duplicate Page Detection in Web Crawling , 2009, 2009 IEEE International Advance Computing Conference.

[32]  Paolo Rosso,et al.  Detection of near-duplicate user generated contents: the SMS spam collection , 2011, SMUC '11.

[33]  Lei Wang,et al.  Achieving both high precision and high recall in near-duplicate detection , 2008, CIKM '08.

[34]  Andrei Z. Broder,et al.  Identifying and Filtering Near-Duplicate Documents , 2000, CPM.

[35]  Dmitri Loguinov,et al.  Probabilistic near-duplicate detection using simhash , 2011, CIKM '11.

[36]  Fan Yang,et al.  Multiple-signal duplicate detection for search evaluation , 2007, SIGIR.

[37]  Ophir Frieder,et al.  Collection statistics for fast duplicate document detection , 2002, TOIS.

[38]  Jack G. Conrad,et al.  Online duplicate document detection: signature reliability in a dynamic retrieval environment , 2003, CIKM '03.

[39]  Roberto J. Bayardo,et al.  Scaling up all pairs similarity search , 2007, WWW '07.

[40]  Grace Hui Yang,et al.  Near-duplicate detection for eRulemaking , 2005, DG.O.