Plagiarism detection in arabic scripts using fuzzy information retreival

The nature of Arabic language structure exposes the need for fuzzy or vague concept to reveal dishonest practices in Arabic documents. In this paper, we present a statement-based plagiarism detection approach in Arabic scripts using fuzzy-set IR model. The degree of similarity is calculated and compared to a threshold value to judge whether two statements are the same or different. Our corpus collection has been built in which all stopwords were removed and non-stop words were stemmed for typical Arabic IR. The corpora have 100 documents with 4367 statements in total. Five query documents with about 250 plagiarized statements were constructed and tested. Experimental results show that fuzzyset IR successfully detected not only exact but also similar statements that have different structure. However, our Arabic fuzzy-set model approach does not handle the case of rewording with different synonyms/antonyms, a deficiency that will lead to future work of modeling the system using Arabic thesaurus. Keywordsfuzzy-set information retrieval; Arabic; plagiarism detection;

[1]  Hatem Haddad,et al.  Arabic Natural Language Processing for Information Retrieval , 2004 .

[2]  Valerie Cross,et al.  Fuzzy information retrieval , 1994, Journal of Intelligent Information Systems.

[3]  Hector Garcia-Molina,et al.  The SCAM Approach to Copy Detection in Digital Libraries , 1995, D Lib Mag..

[4]  Heon Kim,et al.  An Application of Detecting Plagiarism using Dynamic Incremental Comparison Method , 2006, 2006 International Conference on Computational Intelligence and Security.

[5]  Hector Garcia-Molina,et al.  Copy detection mechanisms for digital documents , 1995, SIGMOD '95.

[6]  Thomas P. Way,et al.  SNITCH: a software tool for detecting cut and paste plagiarism , 2006, SIGCSE '06.

[7]  Arkady B. Zaslavsky,et al.  Signature Extraction for Overlap Detection in Documents , 2002, ACSC.

[8]  Yiu-Kai Ng,et al.  Using Word Clusters to Detect Similar Web Documents , 2006, KSEM.

[9]  Kazem Taghva,et al.  Arabic stemming without a root dictionary , 2005, International Conference on Information Technology: Coding and Computing (ITCC'05) - Volume II.

[10]  Tetsuya Morita,et al.  A fuzzy document retrieval system using the keyword connection matrix and a learning method , 1991 .

[11]  Benno Stein,et al.  Plagiarism Detection Without Reference Collections , 2006, GfKl.

[12]  Yiu-Kai Ng,et al.  A Sentence-Based Copy Detection Approach for Web Documents , 2005, FSKD.

[13]  Hector Garcia-Molina,et al.  SCAM: A Copy Detection Mechanism for Digital Documents , 1995, DL.

[14]  Rynson W. H. Lau,et al.  CHECK: a document plagiarism detection system , 1997, SAC '97.

[15]  Ronald Pose,et al.  Proceedings of the twenty-fifth Australasian conference on Computer science - Volume 4 , 2002 .

[16]  Peter C. R. Lane,et al.  Copy detection in Chinese documents using Ferret , 2007, Lang. Resour. Evaluation.

[17]  Arkady B. Zaslavsky,et al.  Document overlap detection system for distributed digital libraries , 2000, DL '00.

[18]  Stefan Gruner,et al.  Tool support for plagiarism detection in text documents , 2005, SAC '05.

[19]  James A. Malcolm,et al.  Plagiarism is Easy, but also Easy To Detect , 2006 .

[20]  Mohammed Salem Farag Wahlan Comparison and fusion of retrieval schemes based on different structures, similarity measures and weighting schemes , 2006 .

[21]  Alexander F. Gelbukh,et al.  PPChecker: Plagiarism Pattern Checker in Document Copy Detection , 2006, TSD.