English-Persian Plagiarism Detection based on a Semantic Approach

Plagiarism which is defined as “the wrongful appropriation of other writers’ or authors’ works and ideas without citing or informing them” poses a major challenge to knowledge spread publication. Plagiarism has been placed in four categories of direct, paraphrasing (rewriting), translation, and combinatory. This paper addresses translational plagiarism which is sometimes referred to as cross-lingual plagiarism. In cross-lingual translation, writers meld a translation with their own words and ideas. Based on monolingual plagiarism detection methods, this paper ultimately intends to find a way to detect cross-lingual plagiarism. A framework called Multi-Lingual Plagiarism Detection (MLPD) has been presented for cross-lingual plagiarism analysis with ultimate objective of detection of plagiarism cases. English is the reference language and Persian materials are back translated using translation tools. The data for assessment of MLPD were obtained from English-Persian Mizan parallel corpus. Apache’s Solr was also applied to record the creep of the documents and their indexation. The accuracy mean of the proposed method revealed to be 98.82% when employing highly accurate translation tools which indicate the high accuracy of the proposed method. Also, Google translation service showed the accuracy mean to be 56.9%. These tests demonstrate that improved translation tools enhance the accuracy of the proposed method.

[1]  Sh. Rafieian,et al.  Plagiarism checker for Persian (PCP) texts using hash-based tree representative fingerprinting , 2016 .

[2]  Alberto Barrón-Cedeño,et al.  Reducing the Plagiarism Detection Search Space on the Basis of the Kullback-Leibler Distance , 2009, CICLing.

[3]  Naomie Salim,et al.  Understanding Plagiarism Linguistic Patterns, Textual Features, and Detection Methods , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[4]  Roman Kern,et al.  External and Intrinsic Plagiarism Detection Using a Cross-Lingual Retrieval and Segmentation System - Lab Report for PAN at CLEF 2010 , 2010, CLEF.

[5]  Renata de Matos Galante,et al.  A New Approach for Cross-Language Plagiarism Analysis , 2010, CLEF.

[6]  Martha Palmer,et al.  Verb Semantics and Lexical Selection , 1994, ACL.

[7]  Hsin-Chang Yang,et al.  A Platform Framework for Cross-Lingual Text Relatedness Evaluation and Plagiarism Detection , 2008, 2008 3rd International Conference on Innovative Computing Information and Control.

[8]  Bruno Pouliquen,et al.  Automatic Identification of Document Translations in Large Multilingual Document Collections , 2006, ArXiv.

[9]  Benno Stein,et al.  A Wikipedia-Based Multilingual Retrieval Model , 2008, ECIR.

[10]  Prasenjit Majumder,et al.  Detection of Paraphrastic Cases of Mono-lingual and Cross-lingual Plagiarism , 2011 .

[11]  Sergey Butakov,et al.  Plagiarism Detection: The Tool And The Case Study , 2008, e-Learning.

[12]  Alberto Barrón-Cedeño,et al.  A statistical approach to crosslingual natural language tasks , 2008, LA-NMR.

[13]  Majid Sarmad,et al.  IBM word-alignment model I for statistical machine translation , 2014 .

[14]  Rakian Shima,et al.  A PERSIAN FUZZY PLAGIARISM DETECTION APPROACH , 2015 .

[15]  Gregory Grefenstette,et al.  Cross-Language Information Retrieval , 1998, The Springer International Series on Information Retrieval.

[16]  Janis Grundspenkis,et al.  Computer-based plagiarism detection methods and tools: an overview , 2007, CompSysTech '07.

[17]  Benno Stein,et al.  Strategies for retrieving plagiarized documents , 2007, SIGIR.

[18]  C. J. van Rijsbergen,et al.  A New Theoretical Framework for Information Retrieval , 1986, SIGIR Forum.

[19]  Karel Jezek,et al.  Multilingual Plagiarism Detection , 2008, AIMSA.

[20]  Zuhair Bandar,et al.  Sentence similarity based on semantic nets and corpus statistics , 2006, IEEE Transactions on Knowledge and Data Engineering.

[21]  Naomie Salim,et al.  Web Based Cross Language Plagiarism Detection , 2010, 2010 Second International Conference on Computational Intelligence, Modelling and Simulation.

[22]  Kurt Hornik,et al.  Text Mining Infrastructure in R , 2008 .

[23]  Daqing He,et al.  Cross-Language Information Retrieval , 2009, Information Retrieval.

[24]  Azadeh Shakery,et al.  Learning to Exploit Different Translation Resources for Cross Language Information Retrieval , 2014, ArXiv.

[25]  Naomie Salim,et al.  Fuzzy Semantic-Based String Similarity for Extrinsic Plagiarism Detection - Lab Report for PAN at CLEF 2010 , 2010, CLEF.

[26]  Stefan Gruner,et al.  Tool support for plagiarism detection in text documents , 2005, SAC '05.

[27]  Benno Stein,et al.  Cross-language plagiarism detection , 2011, Lang. Resour. Evaluation.