论文信息 - A Plagiarism Detection System for Arabic Text-Based Documents

A Plagiarism Detection System for Arabic Text-Based Documents

This paper presents a novel plagiarism detection system for Arabic text-based documents, Iqtebas 1.0. This is a primary work dedicated for plagiarism of Arabic based documents. Arabic is a rich morphological language that is among the top used languages in the world and in the Internet as well. Given a document and a set of suspected files, our goal is to compute the originality value of the examined document. The originality value of a text is computed by computing the distance between each sentence in the text and the closest sentence in the suspected files, if exists. The proposed system structure is based on a search engine in order to reduce the cost of pairwise similarity. For the indexing process, we use the winnowing n-gram fingerprinting algorithm to reduce the index size. The fingerprints of each sentence are its n-grams that are represented by hash codes. The winnowing algorithm computes fingerprints for each sentence. As a result, the search time is improved and the detection process is accurate and robust. The experimental results showed superb performance of Iqtebas 1.0 as it achieved a recall value of 94% and a precision of 99%.Moreover, a comparison that is carried out between Iqtebas and the well known plagiarism detection system, SafeAssign, confirmed the high performance of Iqtebas.

Ashraf Elnagar | Ameera Jadalla | Ashraf Elnagar | A. Jadalla

[1] Justin Zobel,et al. Methods for Identifying Versioned and Plagiarized Documents , 2003, J. Assoc. Inf. Sci. Technol..

[2] K. J. Ottenstein. An algorithmic approach to the detection and prevention of plagiarism , 1976, SGCS.

[3] Sriram Raghavan,et al. Building a distributed full-text index for the Web , 2001, WWW '01.

[4] Fernando Pereira,et al. Multilingual Dependency Analysis with a Two-Stage Discriminative Parser , 2006, CoNLL.

[5] Julie Beth Lovins,et al. Development of a stemming algorithm , 1968, Mech. Transl. Comput. Linguistics.

[6] Vipin Kumar,et al. Introduction to Data Mining, (First Edition) , 2005 .

[7] Daniel T. Larose,et al. Discovering Knowledge in Data: An Introduction to Data Mining , 2005 .

[8] Kazem Taghva,et al. Arabic stemming without a root dictionary , 2005, International Conference on Information Technology: Coding and Computing (ITCC'05) - Volume II.

[9] Yiming Yang,et al. An Evaluation of Statistical Approaches to Text Categorization , 1999, Information Retrieval.

[10] Hector Garcia-Molina,et al. SCAM: A Copy Detection Mechanism for Digital Documents , 1995, DL.

[11] Daniel Shawcross Wilkerson,et al. Winnowing: local algorithms for document fingerprinting , 2003, SIGMOD '03.

[12] Hermann A. Maurer,et al. Plagiarism - A Survey , 2006, J. Univers. Comput. Sci..

[13] JUSTIN ZOBEL,et al. Inverted files for text search engines , 2006, CSUR.

[14] Fintan Culwin,et al. Preserving academic integrity - fighting against nonoriginality agencies , 2007, Br. J. Educ. Technol..

[15] W. Bruce Croft,et al. Local text reuse detection , 2008, SIGIR '08.