Identifying cross-lingual plagiarism using rich semantic features and deep neural networks: A study on Arabic-English plagiarism cases

Abstract The rapid growth in the digital era initiates the need to inculcate and preserve the academic originality of translated texts. Cross-lingual semantic similarity is concerned with identifying the degree of similarity of textual pairs written in two different languages and determining whether they are plagiarized. Unlike existing approaches, which exploit lexical and syntax features for mono-lingual similarity, this work proposed rich semantic features extracted from cross-language textual pairs, including topic similarity, semantic role labeling, spatial role labeling, named entities recognition, bag-of-stop words, bag-of-meanings for all terms, n-most frequent terms, n-least frequent terms, and different sets of their combinations. Knowledge-based semantic networks such as BabelNet and WordNet were used for computing semantic relatedness across different languages. This paper attempts to investigate two tasks, namely, cross-lingual semantic text similarity (CL-STS) and plagiarism detection and judgement (PD) using deep neural networks, which, to the best of our knowledge, have not been implemented before for STS and PD in cross-lingual setting, and using such combination of features. For this purpose, we proposed different neural network architectures to solve the PD task as either binary classification (plagiarism/independently written), or even deeper classification (literally translated/paraphrased/summarized/independently written). Deep neural networks were also used as regressors to predict semantic connotations for CL-STS tasks. Experimental results were performed on a large number of handmade data taken from multiple sources consisting of 71,910 Arabic-English pairs. Overall, experimental results showed that using deep neural networks with rich semantic features achieves encouraging results in comparison to the baselines. The proposed classifiers and regressors tend to show comparable performances when using different architectures of neural networks, but both the binary and multi-class classifiers outperform the regressors. Finally, the evaluation and analysis of using different sets of features reflected the supremacy of deeper semantic features on the classification results.

[1]  Timothy Baldwin,et al.  Semantic role labeling of prepositional phrases , 2006, TALIP.

[2]  Benno Stein,et al.  Cross-language plagiarism detection , 2011, Lang. Resour. Evaluation.

[3]  Parth Gupta,et al.  Cross-language plagiarism detection over continuous-space- and knowledge graph-based representations of language , 2016, Knowl. Based Syst..

[4]  Paul Buitelaar,et al.  Semantic annotation for concept-based cross-language medical information retrieval , 2002, Int. J. Medical Informatics.

[5]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[6]  Dong Zhou,et al.  Translation techniques in cross-language information retrieval , 2012, CSUR.

[7]  Marie-Francine Moens,et al.  Spatial role labeling: Towards extraction of spatial relations from natural language , 2011, TSLP.

[8]  Syed Fawad Hussain,et al.  On retrieving intelligently plagiarized documents using semantic similarity , 2015, Eng. Appl. Artif. Intell..

[9]  Deepa Gupta,et al.  Text plagiarism classification using syntax based linguistic features , 2017, Expert Syst. Appl..

[10]  John B. Goodenough,et al.  Contextual correlates of synonymy , 1965, CACM.

[11]  Ben He,et al.  Mining a multilingual association dictionary from Wikipedia for cross-language information retrieval , 2012, J. Assoc. Inf. Sci. Technol..

[12]  Jürgen Schmidhuber,et al.  Deep learning in neural networks: An overview , 2014, Neural Networks.

[13]  Carlos Angel Iglesias,et al.  Sematch: Semantic similarity framework for Knowledge Graphs , 2017, Knowl. Based Syst..

[14]  Naomie Salim,et al.  An improved plagiarism detection scheme based on semantic role labeling , 2012, Appl. Soft Comput..

[15]  Zuhair Bandar,et al.  Sentence similarity based on semantic nets and corpus statistics , 2006, IEEE Transactions on Knowledge and Data Engineering.

[16]  Vijay K. Mago,et al.  Challenging the Boundaries of Unsupervised Learning for Semantic Similarity , 2019, IEEE Access.

[17]  SchmidhuberJürgen Deep learning in neural networks , 2015 .

[18]  M. Amparo Vila,et al.  An ontology‐based framework for automatic topic detection in multilingual environments , 2018, Int. J. Intell. Syst..

[19]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[20]  Youness Madani,et al.  A new semantic similarity approach for improving the results of an Arabic search engine , 2019, ANT/EDI40.

[21]  Sangeetha Jamal,et al.  An Improved SRL Based Plagiarism Detection Technique Using Sentence Ranking , 2015 .

[22]  Deepa Gupta,et al.  Detection of idea plagiarism using syntax-Semantic concept extractions with genetic algorithm , 2017, Expert Syst. Appl..

[23]  Alberto Barrón-Cedeño,et al.  On the mono- and cross-language detection of text reuse and plagiarism , 2010, Proces. del Leng. Natural.

[24]  Naomie Salim,et al.  Uncovering highly obfuscated plagiarism cases using fuzzy semantic-based similarity model , 2015, J. King Saud Univ. Comput. Inf. Sci..

[25]  Youness Madani,et al.  A New Approach for Calculating Semantic Similarity between Words Using WordNet and Set Theory , 2019, ANT/EDI40.

[26]  Paolo Rosso,et al.  Comparing and combining Content‐ and Citation‐based approaches for plagiarism detection , 2016, J. Assoc. Inf. Sci. Technol..

[27]  Paolo Rosso,et al.  A systematic study of knowledge graph analysis for cross-language plagiarism detection , 2016, Inf. Process. Manag..

[28]  Gregory Grefenstette,et al.  Cross-Language Information Retrieval , 1998, The Springer International Series on Information Retrieval.

[29]  Sabrina Tiun,et al.  Cross-language plagiarism of Arabic-English documents using linear logistic regression , 2016 .

[30]  Geoffrey E. Hinton,et al.  Deep Learning , 2015, Nature.

[31]  Mahmoud Al-Ayyoub,et al.  Paraphrase identification and semantic text similarity analysis in Arabic news tweets using lexical, syntactic, and semantic features , 2017, Inf. Process. Manag..

[32]  Emanuele Caglioti,et al.  A plagiarism detection procedure in three steps: Selection, matches and squares , 2009 .

[33]  Paolo Rosso,et al.  A resource-light method for cross-lingual semantic textual similarity , 2017, Knowl. Based Syst..

[34]  Graeme Hirst,et al.  Evaluating WordNet-based Measures of Lexical Semantic Relatedness , 2006, CL.

[35]  Naomie Salim,et al.  Understanding Plagiarism Linguistic Patterns, Textual Features, and Detection Methods , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[36]  Alberto Barrón-Cedeño,et al.  Methods for cross-language plagiarism detection , 2013, Knowl. Based Syst..

[37]  Rong Qu,et al.  Computing semantic similarity based on novel models of semantic representation using Wikipedia , 2018, Inf. Process. Manag..

[38]  Kenli Li,et al.  An Efficient Framework for Sentence Similarity Modeling , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[39]  Paul Rayson,et al.  CLEU ‐ A Cross‐language english‐urdu corpus and benchmark for text reuse experiments , 2018, J. Assoc. Inf. Sci. Technol..

[40]  Azadeh Shakery,et al.  Candidate document retrieval for cross-lingual plagiarism detection using two-level proximity information , 2016, Inf. Process. Manag..

[41]  Simone Paolo Ponzetto,et al.  BabelNet: The automatic construction, evaluation and application of a wide-coverage multilingual semantic network , 2012, Artif. Intell..

[42]  Naomie Salim,et al.  Existing plagiarism detection techniques: A systematic mapping of the scholarly literature , 2015, Online Inf. Rev..

[43]  Deepa Gupta,et al.  Unmasking text plagiarism using syntactic-semantic based natural language processing techniques: Comparisons, analysis and challenges , 2018, Inf. Process. Manag..

[44]  De Xu,et al.  Concept vector for semantic similarity and relatedness based on WordNet structure , 2012, J. Syst. Softw..

[45]  Montserrat Batet,et al.  An information theoretic approach to improve semantic similarity assessments across multiple ontologies , 2014, Inf. Sci..

[46]  Russell Haitch,et al.  Stealing or Sharing? Cross‐Cultural Issues of Plagiarism in an Open‐Source Era , 2016 .

[47]  Weishan Zhang,et al.  Semantic Similarity Computation Based on Multi-feature Combination using HowNet , 2014, J. Softw..