A Mixed Fuzzy Similarity Approach to Detect Plagiarism in Persian Texts

A variety of methods and metrics have been offered so far to measure the extent of similarity among various documents and plagiarism detection systems. However, most of them do not take ambiguity inherent in natural language into account. Therefore, in this paper, a new method taking lexical and semantic features and similarity measures into consideration has been proposed. In the first step, after preprocessing and removing stop word, a text was divided into two parts: general and domain-specific knowledge words. Then, the mixed lexical and semantic fuzzy inference system was designed to assess text similarity. The proposed method was evaluated on Persian paper abstracts of International Conference on e-Learning and e-Teaching (ICELET) Corpus and using IT domain knowledge ontology. The results indicated that the proposed method can achieve a rate of 79% in terms of precision and can detect 83% of the plagiarism cases.

[1]  Iryna Gurevych,et al.  Text Reuse Detection using a Composition of Text Similarity Measures , 2012, COLING.

[2]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[3]  Lotfi A. Zadeh,et al.  The Concepts of a Linguistic Variable and its Application to Approximate Reasoning , 1975 .

[4]  Naomie Salim,et al.  An improved plagiarism detection scheme based on semantic role labeling , 2012, Appl. Soft Comput..

[5]  Martha Palmer,et al.  Verb Semantics and Lexical Selection , 1994, ACL.

[6]  Michael Luck,et al.  Plagiarism in programming assignments , 1999 .

[7]  Carlo Strapparava,et al.  Corpus-based and Knowledge-based Measures of Text Semantic Similarity , 2006, AAAI.

[8]  Naomie Salim,et al.  Fuzzy Semantic-Based String Similarity for Extrinsic Plagiarism Detection - Lab Report for PAN at CLEF 2010 , 2010, CLEF.

[9]  Yong Wang,et al.  Document Clustering with Semantic Analysis , 2006, Proceedings of the 39th Annual Hawaii International Conference on System Sciences (HICSS'06).

[10]  W. Bruce Croft,et al.  Similarity measures for tracking information flow , 2005, CIKM '05.

[11]  Alberto Barrón-Cedeño,et al.  On Automatic Plagiarism Detection Based on n-Grams Comparison , 2009, ECIR.

[12]  El-Sayed M. El-Alfy,et al.  Boosting paraphrase detection through textual similarity metrics with abductive networks , 2015, Appl. Soft Comput..

[13]  Naomie Salim,et al.  Features Based Text Similarity Detection , 2010, ArXiv.

[14]  Philip Resnik,et al.  Using Information Content to Evaluate Semantic Similarity in a Taxonomy , 1995, IJCAI.

[15]  Rohit Gupta,et al.  UoW: NLP techniques developed at the University of Wolverhampton for Semantic Similarity and Textual Entailment , 2014, *SEMEVAL.

[16]  C.-C. Lu,et al.  An intelligent approach to detecting the bad credit card accounts , 2007, Artificial Intelligence and Applications.

[17]  Lotfi A. Zadeh,et al.  The concept of a linguistic variable and its application to approximate reasoning-III , 1975, Inf. Sci..

[18]  Vasile Rus,et al.  On Paraphrase Identification Corpora , 2014, LREC.

[19]  Leszek Rutkowski,et al.  Flexible neuro-fuzzy systems , 2003, IEEE Trans. Neural Networks.