An Integrated Machine Learning Approach for Extrinsic Plagiarism Detection

Plagiarism detection is gaining increasing importance due to requirements for integrity in education. In this paper, we have developed a new integrated approach for extrinsic plagiarism detection. The proposed approach is based on four well-known models namely Bag of Words (BOW), Latent Semantic Analysis (LSA), Stylometry and Support Vector Machines (SVM). The proposed approach works by capturing usage patterns of the most common words (MCW) from books of 25 authors. Stylistic features for each author were harnessed in the method by adjusting the LSA weighting technique. The adjusted LSA method was trained in a novel manner using the leave-one-out-cross-validation technique and compared with the traditional LSA method. The results have shown that the enhanced weighting method of the adjusted LSA outperforms the traditional LSA method.

[1]  Anne E. James,et al.  Towards the Development of an Integrated Framework for Enhancing Enterprise Search Using Latent Semantic Indexing , 2011, ICCS.

[2]  Chris Fox,et al.  The Influence of Text Pre-processing on Plagiarism Detection , 2009, RANLP.

[3]  Naomie Salim,et al.  Existing plagiarism detection techniques: A systematic mapping of the scholarly literature , 2015, Online Inf. Rev..

[4]  Mike Joy,et al.  Evaluating the Performance of LSA for Source-code Plagiarism Detection , 2012, Informatica.

[5]  R. Iqbal,et al.  A framework for integration of CSCW , 2002, The 7th International Conference on Computer Supported Cooperative Work in Design.

[6]  Rahat Iqbal,et al.  Design implications for task-specific search utilities for retrieval and re-engineering of code , 2017, Enterp. Inf. Syst..

[7]  Debora Weber-Wulff,et al.  Test cases for plagiarism detection software , 2010 .

[8]  Shlomo Argamon,et al.  Mining the Blogosphere: Age, gender and the varieties of self-expression , 2007, First Monday.

[9]  T C Mendenhall,et al.  THE CHARACTERISTIC CURVES OF COMPOSITION. , 1887, Science.

[10]  Fintan Culwin,et al.  A Visual Argument for Plagiarism Detection using Word Pairs , 2004 .

[11]  Radim Řehůřek,et al.  On Dimensionality of Latent Semantic Indexing for TextSegmentation , 2007 .

[12]  Shlomo Argamon,et al.  Interpreting Burrows's Delta: Geometric and Probabilistic Foundations , 2007, Lit. Linguistic Comput..

[13]  Anne E. James,et al.  Investigating the value of retention actions as a source of relevance information in the software development environment , 2012, Proceedings of the 2012 IEEE 16th International Conference on Computer Supported Cooperative Work in Design (CSCWD).

[14]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[15]  F. Mosteller,et al.  Inference and Disputed Authorship: The Federalist , 1966 .

[16]  S.,et al.  Adaptive information retrieval system based on fuzzy profiling , 2016 .

[17]  Wei-Ying Ma,et al.  Improving text classification using local latent semantic indexing , 2004, Fourth IEEE International Conference on Data Mining (ICDM'04).

[18]  Sylvain Arlot,et al.  A survey of cross-validation procedures for model selection , 2009, 0907.4728.

[19]  John Burrows,et al.  'Delta': a Measure of Stylistic Difference and a Guide to Likely Authorship , 2002, Lit. Linguistic Comput..

[20]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[21]  Madalina Zurini,et al.  Stylometry Metrics Selection for Creating a Model for Evaluating the Writing Style of Authors According to Their Cultural Orientation , 2015 .

[22]  Anne E. James,et al.  Activity-led learning approach and group performance analysis using fuzzy rule-based classification model , 2013, Proceedings of the 2013 IEEE 17th International Conference on Computer Supported Cooperative Work in Design (CSCWD).

[23]  K. Shima,et al.  SVM-based feature selection of latent semantic features , 2004, Pattern Recognit. Lett..

[24]  Anne E. James,et al.  Intrinsic Plagiarism Detection Using Latent Semantic Indexing and Stylometry , 2013, 2013 Sixth International Conference on Developments in eSystems Engineering.

[25]  Danielle S. McNamara,et al.  Analyzing Writing Styles with Coh-Metrix , 2006, FLAIRS.

[26]  Fintan Culwin,et al.  Towards an error free plagarism detection process , 2001, ITiCSE.

[27]  Benno Stein,et al.  Intrinsic Plagiarism Detection , 2006, ECIR.

[28]  Efstathios Stamatatos,et al.  A survey of modern authorship attribution methods , 2009, J. Assoc. Inf. Sci. Technol..