A Framework for Plagiarism Detection in Arabic Documents

We are developing a web-based plagiarism detection system to detect plagiarism in written Arabic documents. This paper describes the proposed framework of our plagiarism detection system. The proposed plagiarism detection framework comprises of two main components, one global and the other local. The global component is heuristics-based, in which a potentially plagiarized given document is used to construct a set of representative queries by using different best performing heuristics. These queries are then submitted to Google via Google's search API to retrieve candidate source documents from the Web. The local component carries out detailed similarity computations by combining different similarity computation techniques to check which parts of the given document are plagiarised and from which source documents retrieved from the Web. Since this is an ongoing research project, the quality of overall system is not

[1]  O. Haggag,et al.  Plagiarism Candidate Retrieval Using Selective Query Formulation and Discriminative Query Scoring Notebook for PAN at CLEF 2013 , 2013, CLEF.

[2]  Mehran Sahami,et al.  Text Mining: Classification, Clustering, and Applications , 2009 .

[3]  Ion Androutsopoulos,et al.  A Survey of Paraphrasing and Textual Entailment Methods , 2009, J. Artif. Intell. Res..

[4]  Paul Clough,et al.  Old and new challenges in automatic plagiarism detection , 2003 .

[5]  Paolo Rosso,et al.  Intrinsic Plagiarism Detection in Arabic Text: Preliminary Experiments , 2012 .

[6]  Max Welling,et al.  Fast collapsed gibbs sampling for latent dirichlet allocation , 2008, KDD.

[7]  Naomie Salim,et al.  Fuzzy Semantic-Based String Similarity for Extrinsic Plagiarism Detection - Lab Report for PAN at CLEF 2010 , 2010, CLEF.

[8]  Lucia Specia,et al.  Using Natural Language Processing for Automatic Detection of Plagiarism , 2010 .

[9]  Arkady B. Zaslavsky,et al.  MatchDetectReveal: finding overlapping and similar digital documents , 2000, IRMA Conference.

[10]  Letha H. Etzkorn,et al.  Bug localization using latent Dirichlet allocation , 2010, Inf. Softw. Technol..

[11]  ChengXiang Zhai,et al.  Statistical Language Models for Information Retrieval , 2008, NAACL.

[12]  Per Runeson,et al.  Detection of Duplicate Defect Reports Using Natural Language Processing , 2007, 29th International Conference on Software Engineering (ICSE'07).

[13]  Ahmed E. Hassan,et al.  Studying software evolution using topic models , 2014, Sci. Comput. Program..

[14]  Yuen-Yan Chan,et al.  A natural language processing approach to automatic plagiarism detection , 2007, SIGITE '07.

[15]  Hector Garcia-Molina,et al.  Copy detection mechanisms for digital documents , 1995, SIGMOD '95.

[16]  William John Teahan,et al.  A repetition based measure for verification of text collections and for text categorization , 2003, SIGIR.

[17]  Publisher Iisrc International Journal of Information Technology and Computer Science ( IJITCS ) , 2014 .

[18]  Denys Poshyvanyk,et al.  Using Latent Dirichlet Allocation for automatic categorization of software , 2009, 2009 6th IEEE International Working Conference on Mining Software Repositories.

[19]  Hector Garcia-Molina,et al.  Building a scalable and accurate copy detection mechanism , 1996, DL '96.

[20]  Chris Fox,et al.  The Influence of Text Pre-processing on Plagiarism Detection , 2009, RANLP.

[21]  E. Merzari,et al.  Large-Scale Simulations on Thermal-Hydraulics in Fuel Bundles of Advanced Nuclear Reactors , 2007 .

[22]  Liu Chao Retrieval method for traceability links between source code and Chinese documentation , 2010 .

[23]  Mohamed El Bachir Menai,et al.  Detection of Plagiarism in Arabic Documents , 2012 .

[24]  Benno Stein,et al.  Plagiarism Detection Without Reference Collections , 2006, GfKl.

[25]  Rynson W. H. Lau,et al.  CHECK: a document plagiarism detection system , 1997, SAC '97.

[26]  Andrea De Lucia,et al.  CodeTopics: which topic am I coding now? , 2011, 2011 33rd International Conference on Software Engineering (ICSE).

[27]  Andrei Z. Broder,et al.  On the resemblance and containment of documents , 1997, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171).

[28]  Naomie Salim,et al.  Understanding Plagiarism Linguistic Patterns, Textual Features, and Detection Methods , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[29]  Muazzam Ahmed Siddiqui,et al.  Query Optimization in Arabic Plagiarism Detection: An Empirical Study , 2014 .

[30]  Stéphane Ducasse,et al.  Semantic clustering: Identifying topics in source code , 2007, Inf. Softw. Technol..

[31]  Premkumar T. Devanbu,et al.  Clones: what is that smell? , 2010, 2010 7th IEEE Working Conference on Mining Software Repositories (MSR 2010).