Supervised Ranking for Plagiarism Source Retrieval Notebook for PAN at CLEF 2014

Source retrieval involves making use of a search engine to retrieve candidate sources of plagiarism for a given suspicious document so that more accurate comparisons can be made. We describe a strategy for source retrieval that uses a supervised method to classify and rank search engine results as potential sources of plagiarism without retrieving the documents themselves. Evaluation shows the performance of our approach, which achieved the highest precision (0.57) and F1 score (0.47) in the 2014 PAN Source Retrieval task.

[1]  Thorsten Joachims,et al.  Optimizing search engines using clickthrough data , 2002, KDD.

[2]  Dan Klein,et al.  Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network , 2003, NAACL.

[3]  Donald L. Mccabe Cheating among college and university students: A North American perspective , 2005 .

[4]  Hermann A. Maurer,et al.  Plagiarism - A Survey , 2006, J. Univers. Comput. Sci..

[5]  Haibo He,et al.  Learning from Imbalanced Data , 2009, IEEE Transactions on Knowledge and Data Engineering.

[6]  Feifan Liu,et al.  Unsupervised Approaches for Automatic Keyword Extraction Using Meeting Transcripts , 2009, NAACL.

[7]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[8]  A. Jayapal Similarity Overlap Metric and Greedy String Tiling at PAN 2012 : Plagiarism Detection Notebook for PAN at CLEF 2012 , 2012 .

[9]  Matthias Hagen,et al.  ChatNoir: a search engine for the ClueWeb09 corpus , 2012, SIGIR '12.

[10]  Benno Stein,et al.  Recent Trends in Digital Text Forensics and Its Evaluation - Plagiarism Detection, Author Identification, and Author Profiling , 2013, CLEF.

[11]  Hung-Hsuan Chen,et al.  Unsupervised Ranking for Plagiarism Source Retrieval Notebook for PAN at CLEF 2013 , 2013, CLEF.

[12]  Hung-Hsuan Chen,et al.  Classifying and ranking search engine results as potential sources of plagiarism , 2014, DocEng '14.

[13]  Matthias Hagen,et al.  Overview of the 1st international competition on plagiarism detection , 2009 .