An Approach to Source-Code Plagiarism Detection and Investigation Using Latent Semantic Analysis

Plagiarism is a growing problem in academia. Academics often use plagiarism detection tools to detect similar source-code files. Once similar files are detected, the academic proceeds with the investigation process which involves identifying the similar source-code fragments within them that could be used as evidence for proving plagiarism. This paper describes PlaGate, a novel tool that can be integrated with existing plagiarism detection tools to improve plagiarism detection performance. The tool also implements a new approach for investigating the similarity between source-code files with a view to gathering evidence for proving plagiarism. Graphical evidence is presented that allows for the investigation of source-code fragments with regards to their contribution toward evidence for proving plagiarism. The graphical evidence indicates the relative importance of the given source-code fragments across files in a corpus. This is done by using the Latent Semantic Analysis information retrieval technique to detect how important they are within the specific files under investigation in relation to other files in the corpus.

[1]  William M. Pottenger,et al.  A framework for understanding Latent Semantic Indexing (LSI) performance , 2006, Inf. Process. Manag..

[2]  Roger Bennett,et al.  Factors associated with student plagiarism in a post‐1992 university , 2005 .

[3]  C. Ding A similarity-based probability model for latent semantic indexing , 1999, SIGIR '99.

[4]  Ntoulas Alexandros,et al.  Understanding Search Engines : Requirements for Explaining Search Results , 2001 .

[5]  April Kontostathis,et al.  Essential Dimensions of Latent Semantic Indexing (LSI) , 2007, 2007 40th Annual Hawaii International Conference on System Sciences (HICSS'07).

[6]  T. Landauer,et al.  A Solution to Plato's Problem: The Latent Semantic Analysis Theory of Acquisition, Induction, and Representation of Knowledge. , 1997 .

[7]  M. H. Halstead,et al.  Natural laws controlling algorithm structure? , 1972, SIGP.

[8]  Michael Luck,et al.  Plagiarism in programming assignments , 1999 .

[9]  Ophir Frieder,et al.  Improving relevance feedback in the vector space model , 1997, CIKM '97.

[10]  Michele Lanza,et al.  Interactive Exploration of Semantic Clusters , 2005, 3rd IEEE International Workshop on Visualizing Software for Understanding and Analysis.

[11]  Bob Rehder,et al.  Using latent semantic analysis to assess knowledge: Some technical considerations , 1998 .

[12]  Jozef Colpaert,et al.  Crisis on campus : confronting academic misconduct , 2001 .

[13]  S. Muthukrishnan,et al.  Alphabet Dependence in Parameterized Matching , 1994, Inf. Process. Lett..

[14]  Raymond Hubbard,et al.  An empirical comparison of alternative methods for principal component extraction , 1987 .

[15]  Kenneth Slonneger,et al.  Formal syntax and semantics of programming languages - a laboratory based approach , 1995 .

[16]  Elizabeth R. Jessup,et al.  Matrices, Vector Spaces, and Information Retrieval , 1999, SIAM Rev..

[17]  Maxim Mozgovoy Enhancing Computer-Aided Plagiarism Detection , 2008 .

[18]  Erkki Sutinen,et al.  Comparison of Dimension Reduction Methods for Automated Essay Grading , 2008, J. Educ. Technol. Soc..

[19]  Efstratios Gallopoulos,et al.  Design of a matlab tool-box for term-document matrix generation , 2005 .

[20]  Hector Garcia-Molina,et al.  SCAM: A Copy Detection Mechanism for Digital Documents , 1995, DL.

[21]  Andrian Marcus,et al.  Supporting program comprehension using semantic and structural information , 2001, Proceedings of the 23rd International Conference on Software Engineering. ICSE 2001.

[22]  E. Sutinen,et al.  Automatic assessment of the content of essays based on course materials , 2004, ITRE 2004. 2nd International Conference Information Technology: Research and Education.

[23]  Nicholas Tran,et al.  Sim: a utility for detecting similarity in computer programs , 1999, SIGCSE '99.

[24]  Sami Surakka,et al.  Plaggie: GNU-licensed source code plagiarism detection engine for Java exercises , 2006, Baltic Sea '06.

[25]  Susan T. Dumais,et al.  O'brien. using linear algebra for intelligent information retrieval. technical report ut-cs-94-270 , 1994 .

[26]  Sharon Myers,et al.  Questioning Author(ity): ESL/EFL, Science, and Teaching About Plagiarism. , 1998 .

[27]  Maurice H. Halstead,et al.  Elements of software science (Operating and programming systems series) , 1977 .

[28]  Chris Buckley,et al.  Pivoted Document Length Normalization , 1996, SIGIR Forum.

[29]  Athena Vakali,et al.  PDetect: A Clustering Approach for Detecting Plagiarism in Source Code Datasets , 2005, Comput. J..

[30]  Michael J. Wise,et al.  YAP3: improved detection of similarities in computer program and other texts , 1996, SIGCSE '96.

[31]  JAMES E. KASPRZAK CHEATING IN CYBERSPACE : MAINTAINING QUALITY IN ONLINE EDUCATION , 2004 .

[32]  G. Denhière,et al.  A Computational Model of Children's Semantic Memory , 2004 .

[33]  R. Cattell The Scree Test For The Number Of Factors. , 1966, Multivariate behavioral research.

[34]  Maxim Mozgovoy Desktop Tools for Offline Plagiarism Detection in Computer Programs , 2006, Informatics Educ..

[35]  Gerry Stahl,et al.  Developing Summarization Skills through the Use of LSA-Based Feedback , 2000, Interact. Learn. Environ..

[36]  Linda Klebe Trevino,et al.  Cheating in Academic Institutions: A Decade of Research , 2001 .

[37]  D. Whittington,et al.  Approaches to the computerized assessment of free text responses , 1999 .

[38]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[39]  Clifford Behrens,et al.  Telcordia LSI Engine: implementation and scalability issues , 2001, Proceedings Eleventh International Workshop on Research Issues in Data Engineering. Document Management for Data Intensive Business and Scientific Applications. RIDE 2001.

[40]  Patrick M. Scanlon,et al.  Internet Plagiarism among College Students. , 2002 .

[41]  Karl J. Ottenstein A Program to Count Operators and Operands for ANSI—FORTRAN Modules , 1976 .

[42]  Donna K. Harman,et al.  An experimental study of factors important in document ranking , 1986, SIGIR '86.

[43]  Jonathan I. Maletic,et al.  Automatic software clustering via Latent Semantic Analysis , 1999, 14th IEEE International Conference on Automated Software Engineering.

[44]  Katsuro Inoue,et al.  MUDABlue: An Automatic Categorization System for Open Source Repositories , 2004, APSEC.

[45]  Wendy Sutherland-Smith,et al.  Pandora's box: academic perceptions of student plagiarism in writing , 2005 .

[46]  Erkki Sutinen,et al.  Fast Plagiarism Detection System , 2005, SPIRE.

[47]  JoyMike,et al.  An Approach to Source-Code Plagiarism Detection and Investigation Using Latent Semantic Analysis , 2012 .

[48]  Stéphane Ducasse,et al.  Enriching reverse engineering with semantic clustering , 2005, 12th Working Conference on Reverse Engineering (WCRE'05).

[49]  M. Bartlett TESTS OF SIGNIFICANCE IN FACTOR ANALYSIS , 1950 .

[50]  John W. Tukey,et al.  Exploratory Data Analysis. , 1979 .

[51]  Carlo Strapparava,et al.  Automatic Assessment of Students' Free-Text Answers Underpinned by the Combination of a BLEU-Inspired Algorithm and Latent Semantic Analysis , 2005, FLAIRS Conference.

[52]  Mike Joy,et al.  Source-code Plagiarism: a UK Academic Perspective , 2006 .

[53]  Andrei Z. Broder,et al.  Identifying and Filtering Near-Duplicate Documents , 2000, CPM.

[54]  Stéphane Ducasse,et al.  Semantic clustering: Identifying topics in source code , 2007, Inf. Softw. Technol..

[55]  Andrian Marcus,et al.  Semantic driven program analysis , 2004, 20th IEEE International Conference on Software Maintenance, 2004. Proceedings..

[56]  Ewan D. Tempero,et al.  A Java reuse repository for Eclipse using LSI , 2006, Australian Software Engineering Conference (ASWEC'06).

[57]  Mike Joy,et al.  Towards a Definition of Source-Code Plagiarism , 2008, IEEE Transactions on Education.

[58]  Michael Philippsen,et al.  Finding Plagiarisms among a Set of Programs with JPlag , 2002, J. Univers. Comput. Sci..

[59]  Susan T. Dumais,et al.  Using Linear Algebra for Intelligent Information Retrieval , 1995, SIAM Rev..

[60]  W. Velicer,et al.  Comparison of five rules for determining the number of components to retain. , 1986 .

[61]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[62]  Marian Petre,et al.  E-Assessment using Latent Semantic Analysis in the Computer Science Domain: A Pilot Study , 2004 .

[63]  Andrian Marcus,et al.  An information retrieval approach to concept location in source code , 2004, 11th Working Conference on Reverse Engineering.

[64]  W F Velicer,et al.  Factors Influencing Four Rules For Determining The Number Of Components To Retain. , 1982, Multivariate behavioral research.

[65]  J. Horn A rationale and test for the number of factors in factor analysis , 1965, Psychometrika.

[66]  Karen Spärck Jones A statistical interpretation of term specificity and its application in retrieval , 2021, J. Documentation.

[67]  Elizabeth R. Jessup,et al.  Taking a new look at the latent semantic analysis approach to information retrieval , 2001 .

[68]  Ranald Macdonald,et al.  Exploring staff perceptions of student plagiarism , 2006 .

[69]  Brenda S. Baker,et al.  On finding duplication and near-duplication in large software systems , 1995, Proceedings of 2nd Working Conference on Reverse Engineering.

[70]  Gordon D. Plotkin,et al.  A structural approach to operational semantics , 2004, J. Log. Algebraic Methods Program..

[71]  Fintan Culwin,et al.  Classifications of plagiarism detection engines , 2005 .

[72]  Kimmo Fredriksson,et al.  Efficient parameterized string matching , 2006, Inf. Process. Lett..

[73]  David A. Schmidt,et al.  Denotationaisemantics: a methodology for language development , 1986 .

[74]  Peter W. Foltz,et al.  Learning from text: Matching readers and texts by latent semantic analysis , 1998 .

[75]  Boumediene Belkhouche,et al.  Plagiarism detection in software designs , 2004, ACM-SE 42.

[76]  David J. Steinhart,et al.  Summary Street: An Intelligent Tutoring System for Improving Student Writing through the use of Late , 2001 .

[77]  G. Whale Indentification of Program Similarity in Large Populations , 1990, Comput. J..

[78]  Hans Bergsten JavaServer Pages , 2000 .

[79]  Shalom Lappin An Introduction to Formal Semantics , 2008 .

[80]  Darrell Laham,et al.  From paragraph to graph: Latent semantic analysis for information visualization , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[81]  Michael W. Berry,et al.  Understanding search engines: mathematical modeling and text retrieval (software , 1999 .

[82]  K. J. Ottenstein An algorithmic approach to the detection and prevention of plagiarism , 1976, SGCS.

[83]  Preslav Nakov Latent semantic analysis of textual data , 2000, CompSysTech '00.

[84]  Patrick F. Reidy An Introduction to Latent Semantic Analysis , 2009 .

[85]  A. Graesser,et al.  Improving an intelligent tutor ’ s comprehension of students with Latent Semantic Analysis ∗ , 1999 .

[86]  Michael W. Berry,et al.  Large-Scale Sparse Singular Value Computations , 1992 .

[87]  H. Schütze,et al.  Dimensions of meaning , 1992, Supercomputing '92.

[88]  Peter W. Foltz,et al.  Learning Human-like Knowledge by Singular Value Decomposition: A Progress Report , 1997, NIPS.

[89]  V. Klyuev,et al.  Fast and reliable plagiarism detection system , 2007, 2007 37th Annual Frontiers In Education Conference - Global Engineering: Knowledge Without Borders, Opportunities Without Passports.

[90]  Bob Rehder,et al.  How Well Can Passage Meaning be Derived without Using Word Order? A Comparison of Latent Semantic Analysis and Humans , 1997 .

[91]  Alejandro A. Schäffer,et al.  Multiple Matching of Parameterized Patterns , 1994, CPM.

[92]  Mary Anne Nixon Cheating in Cyberspace: Maintaining Quality in Online Education , 2004 .

[93]  Charles A. Perfetti,et al.  The limits of co‐occurrence: Tools and theories in language research , 1998 .

[94]  Barbara G. Tabachnick,et al.  Why Professors Ignore Cheating: Opinions of a National Sample of Psychology Instructors , 1998 .

[95]  R. E. Story,et al.  An Explanation of the Effectiveness of Latent Semantic Indexing by Means of a Bayesian Regression Model , 1996, Inf. Process. Manag..

[96]  Lynn A. Streeter,et al.  Comparing and combining the effectiveness of latent semantic indexing and the ordinary vector space model for information retrieval , 1989, Inf. Process. Manag..

[97]  Gerard Salton,et al.  Document Length Normalization , 1995, Inf. Process. Manag..

[98]  Peter M. Wiemer-Hastings,et al.  How Latent is Latent Semantic Analysis? , 1999, IJCAI.

[99]  Mikaela Björklund,et al.  Academic cheating : frequency, methods and causes. Paper presented at the European Conference on Educational Research (ECER), September 22-25, Lahti, Finland , 1999 .

[100]  How well do students really understand plagiarism ? , 2005 .

[101]  Peter J. Larkham,et al.  Plagiarism and its Treatment in Higher Education , 2002 .

[102]  William M. Pottenger,et al.  A Framework for Understanding LSI Performance , 2004 .

[103]  Udi Manber,et al.  Finding Similar Files in a Large File System , 1994, USENIX Winter.

[104]  Paul Clough,et al.  Plagiarism in natural and programming languages: an overview of current tools and technologies , 2000 .

[105]  Daniel Shawcross Wilkerson,et al.  Winnowing: local algorithms for document fingerprinting , 2003, SIGMOD '03.

[106]  Stuart Hannabuss,et al.  Contested texts: issues of plagiarism , 2001 .

[107]  Ann-Marie Lancaster,et al.  A plagiarism detection system , 1981, SIGCSE '81.

[108]  Angela Carbone,et al.  Determination of Factors which Impact on IT Students' Propensity to Cheat , 2003, ACE.

[109]  W. Velicer Determining the number of components from the matrix of partial correlations , 1976 .

[110]  Sandra G. Nadelson,et al.  Academic Misconduct by University Students:   Faculty Perceptions and Responses , 2022 .

[111]  Judithe Sheard,et al.  Addressing student cheating: definitions and solutions , 2003, ACM SIGCSE Bull..

[112]  K. Jöreskog Some contributions to maximum likelihood factor analysis , 1967 .

[113]  Susan T. Dumais,et al.  LSI meets TREC: A Status Report , 1992, TREC.

[114]  Susan T. Dumais,et al.  Statistical semantics: analysis of the potential performance of keyword information systems , 1984 .

[115]  Peter W. Foltz,et al.  An introduction to latent semantic analysis , 1998 .

[116]  Erkki Sutinen,et al.  Noise reduction in LSA-based essay assessment , 2005 .

[117]  Charles A. Perfetti,et al.  Using Intelligent Feedback to Improve Sourcing and Integration in Students' Essays , 2004, Int. J. Artif. Intell. Educ..

[118]  S.K. Dey,et al.  Impact of Unethical Practices of Plagiarism on Learning, Teaching and Research in Higher Education: Some Combating Strategies , 2006, 2006 7th International Conference on Information Technology Based Higher Education and Training.

[119]  P. W. Foltz,et al.  Using latent semantic indexing for information filtering , 1990, COCS '90.

[120]  Maurice H. Halstead,et al.  Elements of software science , 1977 .

[121]  Yi Liu,et al.  A Simplified Latent Semantic Indexing Approach for Multi-Linguistic Information Retrieval , 2003, PACLIC.

[122]  Andrei Z. Broder,et al.  On the resemblance and containment of documents , 1997, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171).

[123]  Susan T. Dumais,et al.  Improving the retrieval of information from external sources , 1991 .

[124]  Sally S. Robinson,et al.  An instructional aid for student programs , 1980, SIGCSE '80.

[125]  Hector Garcia-Molina,et al.  Copy detection mechanisms for digital documents , 1995, SIGMOD '95.

[126]  Edward L. Jones METRICS BASED PLAGIARISM MONITORING , 2001 .

[127]  Brenda S. Baker Parameterized pattern matching by Boyer-Moore-type algorithms , 1995, SODA '95.

[128]  Erkki Sutinen,et al.  Automatic Essay Grading with Probabilistic Latent Semantic Analysis , 2005 .

[129]  Jorma Tarhio,et al.  Sublinear Algorithms for Parameterized Matching , 2006, CPM.

[130]  S. Sudarsun,et al.  Role of Weighting on TDM in Improvising Performance of LSA on Text Data , 2006, 2006 Annual IEEE India Conference.

[131]  Jamie Beasley The Impact of Technology on Plagiarism Prevention and Detection , 2004 .

[132]  Katsuro Inoue,et al.  Automatic categorization algorithm for evolvable software archive , 2003, Sixth International Workshop on Principles of Software Evolution, 2003. Proceedings..

[133]  Michael J. Wise,et al.  Software for detecting suspected plagiarism: comparing structure and attribute-counting systems , 1996, ACSE '96.

[134]  Michael W. Berry,et al.  Understanding search engines: mathematical modeling and text retrieval (software , 2005 .

[135]  Preslav Nakov,et al.  Weight functions impact on LSA performance , 2001 .

[136]  H. Kaiser The Application of Electronic Computers to Factor Analysis , 1960 .

[137]  Justin Zobel,et al.  Methods for Identifying Versioned and Plagiarized Documents , 2003, J. Assoc. Inf. Sci. Technol..

[138]  S. T. Dumais,et al.  Using latent semantic analysis to improve access to textual information , 1988, CHI '88.

[139]  L. Ferré Selection of components in principal component analysis: a comparison of methods , 1995 .

[140]  Gustaf Neumann,et al.  Parameters driving effectiveness of automated essay scoring with LSA , 2005 .

[141]  Emine Yilmaz,et al.  A geometric interpretation of r-precision and its correlation with average precision , 2005, SIGIR '05.

[142]  Michael J. Wise,et al.  Plagiarism à la Mode: A Comparison of Automated Systems for Detecting Suspected Plagiarism , 1996, Comput. J..