An efficient classification approach in imbalanced datasets for intrinsic plagiarism detection

The ever increasing volume of information due to the widespread use of computers and the web has made effective plagiarism detection methods a necessity. Plagiarism can be found in many settings and forms, in literature, in academic papers, even in programming code. Intrinsic plagiarism detection is the task that deals with the discovery of plagiarized passages in a text document, by identifying the stylistic changes and inconsistencies within the document itself, given that no reference corpus is available. The main idea consists in profiling the style of the original author and marking the passages that seem to differ significantly. In this work, we follow a supervised machine learning classification approach. We consider, for the first time, the fact of imbalanced data as a crucial parameter of the problem and experiment with various balancing techniques. Apart from this, we propose some novel stylistic features. We combine our features and imbalanced dataset treatment with various classification methods. Our detection system is tested on the data corpora of PAN Webis intrinsic plagiarism detection shared tasks. It is compared to the best performing detection systems on these datasets, and succeeds the best resulting scores.

[1]  Peifeng Li,et al.  Research on Intrinsic Plagiarism Detection Resolution: A Supervised Learning Approach , 2012, CLSW.

[2]  Efstathios Stamatatos,et al.  Intrinsic Plagiarism Detection Using Character n-gram Profiles , 2009 .

[3]  Dennis L. Wilson,et al.  Asymptotic Properties of Nearest Neighbor Rules Using Edited Data , 1972, IEEE Trans. Syst. Man Cybern..

[4]  Matthias Hagen,et al.  Overview of the 1st international competition on plagiarism detection , 2009 .

[5]  Jacek Kitowski,et al.  Optimisation of Character n-gram Profiles Method for Intrinsic Plagiarism Detection , 2014, ICAISC.

[6]  Benno Stein,et al.  Overview of the PAN/CLEF 2015 Evaluation Lab , 2015, CLEF.

[7]  Dara Curran,et al.  An Evolutionary Neural Network Approach to Intrinsic Plagiarism Detection , 2009, AICS.

[8]  Efstathios Stamatatos,et al.  Overview of the Author Identification Task at PAN 2013 , 2013, CLEF.

[9]  Efstathios Stamatatos A survey of modern authorship attribution methods , 2009 .

[10]  Günther Specht,et al.  Using Grammar-Profiles to Intrinsically Expose Plagiarism in Text Documents , 2013, NLDB.

[11]  Fernando Nogueira,et al.  Imbalanced-learn: A Python Toolbox to Tackle the Curse of Imbalanced Datasets in Machine Learning , 2016, J. Mach. Learn. Res..

[12]  Anne E. James,et al.  Intrinsic Plagiarism Detection Using Latent Semantic Indexing and Stylometry , 2013, 2013 Sixth International Conference on Developments in eSystems Engineering.

[13]  Moshe Koppel,et al.  Authorship verification as a one-class classification problem , 2004, ICML.

[14]  William H. DuBay The Principles of Readability. , 2004 .

[15]  Gustavo E. A. P. A. Batista,et al.  A study of the behavior of several methods for balancing machine learning training data , 2004, SKDD.

[16]  Benno Stein,et al.  Intrinsic plagiarism analysis , 2011, Lang. Resour. Evaluation.

[17]  Rajarathnam Chandramouli,et al.  Author gender identification from text , 2011, Digit. Investig..

[18]  Naomie Salim,et al.  Understanding Plagiarism Linguistic Patterns, Textual Features, and Detection Methods , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[19]  Dragomir R. Radev,et al.  Book Review: Graph-Based Natural Language Processing and Information Retrieval by Rada Mihalcea and Dragomir Radev , 2011, CL.

[20]  Juan D. Velásquez,et al.  Text mining applied to plagiarism detection: The use of words for detecting deviations in the writing style , 2013, Expert Syst. Appl..

[21]  Paolo Rosso,et al.  Our Method , 1867, Hall's journal of health.

[22]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[23]  Kamalanath Priyantha Hewagamage,et al.  Intrinsic Plagiarism Detection with kohonen Self Organizing Maps , 2011 .

[24]  D. Holmes The Evolution of Stylometry in Humanities Scholarship , 1998 .

[25]  Günther Specht,et al.  Plag-Inn: Intrinsic Plagiarism Detection Using Grammar Trees , 2012, NLDB.

[26]  Yanqing Zhang,et al.  SVMs Modeling for Highly Imbalanced Classification , 2009, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).