Unsupervised Ranking for Plagiarism Source Retrieval Notebook for PAN at CLEF 2013

The source retrieval task for plagiarism detection involves the use of a search engine to retrieve candidate sources of plagiarism for a suspicious document and provides a way to efficiently identify candidate documents so that more accurate comparisons can take place. We describe a strategy for source retrieval that makes use of an unsupervised ranking method to rank the results returned by a search engine by their similarity with the query document and that only retrieves documents that are likely to be sources of plagiarism. Evaluation shows the performance of our approach, which achieved the highest F1 score (0.47) among all task participants.

[1]  Matthias Hagen,et al.  Crowdsourcing Interaction Logs to Understand Text Reuse from the Web , 2013, ACL.

[2]  Dan Klein,et al.  Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network , 2003, NAACL.

[3]  A. Jayapal Similarity Overlap Metric and Greedy String Tiling at PAN 2012 : Plagiarism Detection Notebook for PAN at CLEF 2012 , 2012 .

[4]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[5]  Yin Yang,et al.  Query by document , 2009, WSDM '09.

[6]  Feifan Liu,et al.  Unsupervised Approaches for Automatic Keyword Extraction Using Meeting Transcripts , 2009, NAACL.

[7]  Hermann A. Maurer,et al.  Plagiarism - A Survey , 2006, J. Univers. Comput. Sci..

[8]  Matthias Hagen,et al.  Overview of the 1st international competition on plagiarism detection , 2009 .

[9]  Arun Jayapal Similarity Overlap Metric and Greedy String Tiling for Plagiarism Detection at PAN 2012 , 2012, CLEF.

[10]  Ali Dasdan,et al.  Automatic retrieval of similar content using search engine query interface , 2009, CIKM.

[11]  Hung-Hsuan Chen,et al.  Classifying and ranking search engine results as potential sources of plagiarism , 2014, DocEng '14.

[12]  Haibo He,et al.  Learning from Imbalanced Data , 2009, IEEE Transactions on Knowledge and Data Engineering.

[13]  Nivio Ziviani,et al.  Retrieving Similar Documents from the Web , 2003, J. Web Eng..

[14]  Benno Stein,et al.  Recent Trends in Digital Text Forensics and Its Evaluation - Plagiarism Detection, Author Identification, and Author Profiling , 2013, CLEF.

[15]  Geoffrey Zweig,et al.  Syntactic Clustering of the Web , 1997, Comput. Networks.

[16]  Thorsten Joachims,et al.  Optimizing search engines using clickthrough data , 2002, KDD.

[17]  Donald L. Mccabe Cheating among college and university students: A North American perspective , 2005 .

[18]  Vincent Ng,et al.  Conundrums in Unsupervised Keyphrase Extraction: Making Sense of the State-of-the-Art , 2010, COLING.

[19]  Matthias Hagen,et al.  ChatNoir: a search engine for the ClueWeb09 corpus , 2012, SIGIR '12.

[20]  Marcos André Gonçalves,et al.  A source independent framework for research paper recommendation , 2011, JCDL '11.