Cross-Language Plagiarism Detection System Using Latent Semantic Analysis and Learning Vector Quantization

Computerized cross-language plagiarism detection has recently become essential. With the scarcity of scientific publications in Bahasa Indonesia, many Indonesian authors frequently consult publications in English in order to boost the quantity of scientific publications in Bahasa Indonesia (which is currently rising). Due to the syntax disparity between Bahasa Indonesia and English, most of the existing methods for automated cross-language plagiarism detection do not provide satisfactory results. This paper analyses the probability of developing Latent Semantic Analysis (LSA) for a computerized cross-language plagiarism detector for two languages with different syntax. To improve performance, various alterations in LSA are suggested. By using a linear vector quantization (LVQ) classifier in the LSA and taking into account the Frobenius norm, output has reached up to 65.98% in accuracy. The results of the experiments showed that the best accuracy achieved is 87% with a document size of 6 words, and the document definition size must be kept below 10 words in order to maintain high accuracy. Additionally, based on experimental results, this paper suggests utilizing the frequency occurrence method as opposed to the binary method for the term–document matrix construction.

[1]  S. Dumais Latent Semantic Analysis. , 2005 .

[2]  T. Bretag,et al.  Self-Plagiarism or Appropriate Textual Re-use? , 2009 .

[3]  Charles A. Perfetti,et al.  Using Intelligent Feedback to Improve Sourcing and Integration in Students' Essays , 2004, Int. J. Artif. Intell. Educ..

[4]  Oscar Corcho,et al.  The Semantic Web: Research and Applications , 2012, Lecture Notes in Computer Science.

[5]  Peter W. Foltz,et al.  An introduction to latent semantic analysis , 1998 .

[6]  Norman Meuschke,et al.  Citation pattern matching algorithms for citation-based plagiarism detection: greedy citation tiling, citation chunking and longest common citation sequence , 2011, DocEng '11.

[7]  Stan Matwin,et al.  Intrinsic Plagiarism Detection using Complexity Analysis , 2009 .

[8]  Hermann A. Maurer,et al.  Plagiarism - A Survey , 2006, J. Univers. Comput. Sci..

[9]  Ayu Purwarianti,et al.  Experiments on the Indonesian plagiarism detection using latent semantic analysis , 2014, 2014 2nd International Conference on Information and Communication Technology (ICoICT).

[10]  David L. Olson,et al.  Advanced Data Mining Techniques , 2008 .

[11]  Mounir Errami,et al.  Déjà vu: a database of highly similar citations in the scientific literature , 2008, Nucleic Acids Res..

[12]  Anne E. James,et al.  Intrinsic Plagiarism Detection Using Latent Semantic Indexing and Stylometry , 2013, 2013 Sixth International Conference on Developments in eSystems Engineering.

[13]  Tuomo Kakkonen,et al.  Hermetic and Web Plagiarism Detection Systems for Student Essays—An Evaluation of the State-of-the-Art , 2010 .

[14]  Chris Fox,et al.  The Handbook of Contemporary Semantic Theory: Lappin/The Handbook of Contemporary Semantic Theory , 2015 .

[15]  Václav Snásel,et al.  Overview and Comparison of Plagiarism Detection Tools , 2011, DATESO.

[16]  Roman Kern,et al.  External and Intrinsic Plagiarism Detection Using Vector Space Models , 2009 .

[17]  Teuvo Kohonen,et al.  The self-organizing map , 1990 .