Classifying and ranking search engine results as potential sources of plagiarism

Source retrieval for plagiarism detection involves using a search engine to retrieve candidate sources of plagiarism for a given suspicious document so that more accurate comparisons can be made. An important consideration is that only documents that are likely to be sources of plagiarism should be retrieved so as to minimize the number of unnecessary comparisons made. A supervised strategy for source retrieval is described whereby search results are classified and ranked as potential sources of plagiarism without retrieving the search result documents and using only the information available at search time. The performance of the supervised method is compared to a baseline method and shown to improve precision by up to 3.28%, recall by up to 2.6% and the F1 score by up to 3.37%. Furthermore, features are analyzed to determine which of them are most important for search result classification with features based on document and search result similarity appearing to be the most important.

[1]  Jian Hu,et al.  Optimizing search engine revenue in sponsored search , 2009, SIGIR.

[2]  Wei Fan,et al.  On the Optimality of Probability Estimation by Random Decision Trees , 2004, AAAI.

[3]  Maria Soledad Pera,et al.  BReK12: a book recommender for K-12 users , 2012, SIGIR '12.

[4]  Dan Klein,et al.  Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network , 2003, NAACL.

[5]  Hermann A. Maurer,et al.  Plagiarism - A Survey , 2006, J. Univers. Comput. Sci..

[6]  Donald L. Mccabe Cheating among college and university students: A North American perspective , 2005 .

[7]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition , 2001, Springer Series in Statistics.

[8]  Rong Jin,et al.  Semi-Supervised Ensemble Ranking , 2008, AAAI.

[9]  Mihai Surdeanu,et al.  Learning to Rank Answers on Large Online QA Collections , 2008, ACL.

[10]  Hung-Hsuan Chen,et al.  Unsupervised Ranking for Plagiarism Source Retrieval Notebook for PAN at CLEF 2013 , 2013, CLEF.

[11]  Matthias Hagen,et al.  Crowdsourcing Interaction Logs to Understand Text Reuse from the Web , 2013, ACL.

[12]  Matthias Hagen,et al.  ChatNoir: a search engine for the ClueWeb09 corpus , 2012, SIGIR '12.

[13]  Mark Levene,et al.  Ranking Classes of Search Engine Results , 2010, KDIR.

[14]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[15]  Haibo He,et al.  Learning from Imbalanced Data , 2009, IEEE Transactions on Knowledge and Data Engineering.

[16]  Krishnan Ramanathan,et al.  Similar Document Search and Recommendation , 2012 .

[17]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[18]  Nivio Ziviani,et al.  Retrieving Similar Documents from the Web , 2003, J. Web Eng..

[19]  Thomas G. Dietterich Multiple Classifier Systems , 2000, Lecture Notes in Computer Science.

[20]  Feifan Liu,et al.  Unsupervised Approaches for Automatic Keyword Extraction Using Meeting Transcripts , 2009, NAACL.

[21]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[22]  Keiichi Nakata,et al.  Hierarchical Classification of Web Search Results Using Personalized Ontologies , 2005 .

[23]  P ? ? ? ? ? ? ? % ? ? ? ? , 1991 .

[24]  Laurence T. Yang,et al.  Query by document via a decomposition-based two-level retrieval approach , 2011, SIGIR.

[25]  Ali Dasdan,et al.  Automatic retrieval of similar content using search engine query interface , 2009, CIKM.

[26]  Filip Radlinski,et al.  A support vector method for optimizing average precision , 2007, SIGIR.

[27]  Matthias Hagen,et al.  Overview of the 1st international competition on plagiarism detection , 2009 .

[28]  Eduard H. Hovy,et al.  Mining and Re-ranking for Answering Biographical Queries on the Web , 2006, AAAI.

[29]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[30]  Norman Meuschke,et al.  Citation pattern matching algorithms for citation-based plagiarism detection: greedy citation tiling, citation chunking and longest common citation sequence , 2011, DocEng '11.

[31]  Yin Yang,et al.  Query by document , 2009, WSDM '09.

[32]  D. Ruppert The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2004 .

[33]  Xuanjing Huang,et al.  Efficient partial-duplicate detection based on sequence matching , 2010, SIGIR.

[34]  W. Bruce Croft,et al.  Generating queries from user-selected text , 2012, IIiX.

[35]  Benno Stein,et al.  Recent Trends in Digital Text Forensics and Its Evaluation - Plagiarism Detection, Author Identification, and Author Profiling , 2013, CLEF.

[36]  Thorsten Joachims,et al.  Optimizing search engines using clickthrough data , 2002, KDD.

[37]  Tie-Yan Liu,et al.  Learning to Rank for Information Retrieval , 2011 .