An integrated approach for intrinsic plagiarism detection

Abstract Employing effective plagiarism detection methods are seen to be essential in the next generation web. In this paper, we present a novel approach for plagiarism detection without reference collections. The proposed approach relies on using some statistical properties of the most common words, and the Latent Semantic Analysis that is applied to extract the most common words usage patterns. This method aims to generate a model of author’s “style” by revealing a set of certain features of authorship. The model generation procedure focuses on just one author, as an attempt to summarise the aspects of an author’s style in a definitive and clear-cut manner. The feature set of the intrinsic model were based on the frequency of the most common words, their relative frequencies in the book series, and the deviation of these frequencies across all books for a particular author. The approach has been evaluated using the leave-one-out-cross-validation method on the CEN (Corpus of English Novel) data set. Results have indicated that, by integrating deep latent semantic and stylometric analyses, hidden changes can be identified when a reference collection does not exist. The results have also shown that our Multi-Layer Perceptron based approach statistically outperforms Bayesian Network, Support Vector Machine and Random Forest models, by accurately predicting the author classes with an overall accuracy of 97%.

[1]  Patrick Juola,et al.  A Controlled-corpus Experiment in Authorship Identification by Cross-entropy , 2003 .

[2]  Moshe Koppel,et al.  Determining an author's native language by mining a text for errors , 2005, KDD '05.

[3]  Anne Morris,et al.  The problem of information overload in business organisations: a review of the literature , 2000, Int. J. Inf. Manag..

[4]  John Burrows,et al.  'Delta': a Measure of Stylistic Difference and a Guide to Likely Authorship , 2002, Lit. Linguistic Comput..

[5]  Rahat Iqbal,et al.  User-centred design and evaluation of ubiquitous services , 2005, SIGDOC '05.

[6]  Madalina Zurini,et al.  Stylometry Metrics Selection for Creating a Model for Evaluating the Writing Style of Authors According to Their Cultural Orientation , 2015 .

[7]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[8]  Naomie Salim,et al.  Understanding Plagiarism Linguistic Patterns, Textual Features, and Detection Methods , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[9]  Tuomo Kakkonen,et al.  Hermetic and Web Plagiarism Detection Systems for Student Essays—An Evaluation of the State-of-the-Art , 2010 .

[10]  Anne E. James,et al.  ARREST: From work practices to redesign for usability , 2011, Expert Syst. Appl..

[11]  Shlomo Argamon,et al.  Computational methods in authorship attribution , 2009 .

[12]  Robert P. W. Duin,et al.  Uniform Object Generation for Optimizing One-class Classifiers , 2002, J. Mach. Learn. Res..

[13]  Anne E. James,et al.  Activity-led learning approach and group performance analysis using fuzzy rule-based classification model , 2013, Proceedings of the 2013 IEEE 17th International Conference on Computer Supported Cooperative Work in Design (CSCWD).

[14]  Georgina Cosma,et al.  An Approach to Source-Code Plagiarism Detection and Investigation Using Latent Semantic Analysis , 2012, IEEE Transactions on Computers.

[15]  Efstathios Stamatatos A survey of modern authorship attribution methods , 2009 .

[16]  M. M. Moya,et al.  Cueing, feature discovery, and one-class learning for synthetic aperture radar automatic target recognition , 1995, Neural Networks.

[17]  Justin Zobel,et al.  Searching With Style: Authorship Attribution in Classic Literature , 2007, ACSC.

[18]  W. Fucks ON MATHEMATICAL ANALYSIS OF STYLE , 1952 .

[19]  Hsinchun Chen,et al.  Writeprints: A stylometric approach to identity-level identification and similarity detection in cyberspace , 2008, TOIS.

[20]  Hsinchun Chen,et al.  A framework for authorship identification of online messages: Writing-style features and classification techniques , 2006 .

[21]  Murat Can Ganiz,et al.  A corpus-based semantic kernel for text classification by using meaning values of terms , 2015, Eng. Appl. Artif. Intell..

[22]  David I. Holmes,et al.  Neural network applications in stylometry: The Federalist Papers , 1996, Comput. Humanit..

[23]  Benno Stein,et al.  Plagiarism Detection Without Reference Collections , 2006, GfKl.

[24]  G. Yule ON SENTENCE- LENGTH AS A STATISTICAL CHARACTERISTIC OF STYLE IN PROSE: WITH APPLICATION TO TWO CASES OF DISPUTED AUTHORSHIP , 1939 .

[25]  T C Mendenhall,et al.  THE CHARACTERISTIC CURVES OF COMPOSITION. , 1887, Science.

[26]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[27]  Walter Daelemans,et al.  Authorship Attribution and Verification with Many Authors and Limited Data , 2008, COLING.

[28]  John Burrows,et al.  Word-Patterns and Story-Shapes: The Statistical Analysis of Narrative Style , 1987 .

[29]  Robert Matthews,et al.  Neural Computation in Stylometry I: An Application to the Works of Shakespeare and Fletcher , 1993 .

[30]  Stephen J. Roberts,et al.  A Probabilistic Resource Allocating Network for Novelty Detection , 1994, Neural Computation.

[31]  Jörg Kindermann,et al.  Authorship Attribution with Support Vector Machines , 2003, Applied Intelligence.

[32]  Peter W. Foltz,et al.  An introduction to latent semantic analysis , 1998 .

[33]  Moshe Koppel,et al.  Authorship verification as a one-class classification problem , 2004, ICML.