Fuzzy Semantic-Based Similarity and Big Data for Detecting Multilingual Plagiarism in Arabic Documents

Plagiarism (intelligent-monolingual) is a complicated fuzzy process, adding translation and making it a cross language problem turn thing to be more obfuscated, what pose difficulties to current plagiarism detection methods. Multilingual plagiarism nature could be more complicated than simple copy + translate and paste, it is defined as the unacknowledged reuse of a text involving its translation from one natural language to another without proper referencing to the original source. Before the detecting process several NLP techniques were used to characterize input texts (tokenization, stop words removal, post-tagging, and text segmentation). In this paper, fuzzy semantic similarity between words is studied using WordNet-based similarity measures Wu & Palmer and Lin. In any data processing system the common problem is efficient large-scale text comparison, especially fuzzy-based semantic similarity to reveal dishonest practices in Arabic documents, first due to the complexity of the Arabic language and the increase in the number of publications and the rate of suspicious documents sources of plagiarism. To remedy this, vague concepts and fuzzy techniques in a big data environment will be used. The work is done in a parallel way using Apache Hadoop with its distributed file system HDFS and the MapReduce programming model. The proposed approach was evaluated on 400 English and Arabic cases of different sources (news, articles, tweets, and academic works), including 25% machine based translated plagiarism cases, and 75% translated (machine and human based) with a percentage of obfuscated plagiarism e.g. handmade paraphrases and back-translation. We effectuate some experimental verifications and comparisons showing that results and running time of Fuzzy-WuP are better than Fuzzy-Lin. Results are evaluated based on three testing parameters: precision, recall and F-measure.

[1]  Abhigyan Tiwary,et al.  Plagiarism detection on bigdata using modified map-reduced based SCAM algorithm , 2017, 2017 International Conference on Innovative Mechanisms for Industry Applications (ICIMIA).

[2]  Deepa Gupta,et al.  Using Natural Language Processing techniques and fuzzy-semantic similarity for automatic external plagiarism detection , 2014, 2014 International Conference on Advances in Computing, Communications and Informatics (ICACCI).

[3]  Mihai Surdeanu,et al.  The Stanford CoreNLP Natural Language Processing Toolkit , 2014, ACL.

[4]  Taghi M. Khoshgoftaar,et al.  Deep learning applications and challenges in big data analytics , 2015, Journal of Big Data.

[5]  Philip Resnik,et al.  Using Information Content to Evaluate Semantic Similarity in a Taxonomy , 1995, IJCAI.

[6]  Christiane Fellbaum,et al.  Combining Local Context and Wordnet Similarity for Word Sense Identification , 1998 .

[7]  Youness Madani,et al.  An Approach of Semantic Similarity Measure between Documents Based on Big Data , 2016 .

[8]  Xuanjing Huang,et al.  Efficient partial-duplicate detection based on sequence matching , 2010, SIGIR.

[9]  Behrooz Parhami A highly parallel computing system for information retrieval , 1972, AFIPS '72 (Fall, part II).

[10]  Dekang Lin,et al.  Principle-Based Parsing Without Overgeneration , 1993, ACL.

[11]  Dekang Lin,et al.  An Information-Theoretic Definition of Similarity , 1998, ICML.

[12]  Martha Palmer,et al.  Verb Semantics and Lexical Selection , 1994, ACL.

[13]  Mohamed Oukessou,et al.  Fuzzy-Semantic Similarity for Automatic Multilingual Plagiarism Detection , 2017 .

[14]  Yiu-Kai Ng,et al.  A Sentence-Based Copy Detection Approach for Web Documents , 2005, FSKD.

[15]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[16]  Graeme Hirst,et al.  Lexical chains as representations of context for the detection and correction of malapropisms , 1995 .