Siamese-Based Architecture for Cross-Lingual Plagiarism Detection in English-Hindi Language Pairs

The cross-lingual plagiarism detection (CLPD) is a challenging problem in natural language processing. Cross-lingual plagiarism is when a text is translated from any other language and used as it is without proper acknowledgment. Most of the existing methods provide good results for monolingual plagiarism detection, whereas the performances of existing methods for the CLPD are very limited. The reason for this is that it is difficult to represent the text from two different languages in a common semantic space. In this article, a novel Siamese architecture-based model is proposed to detect the cross-lingual plagiarism in English-Hindi language pairs. The proposed model combines the convolutional neural network (CNN) and bidirectional long short-term memory (Bi-LSTM) network to learn the semantic similarity among the cross-lingual sentences for the English-Hindi language pairs. In the proposed model, the CNN model learns the local context of words, whereas the Bi-LSTM model learns the global context of sentences in forward and backward directions. The performances of the proposed models are evaluated on the benchmark data set, that is, Microsoft paraphrase corpus, which is converted in the English-Hindi language pairs. The proposed model outperforms other models giving 67%, 72%, and 67% weighted average precision, recall, and F1-measure scores. The experimental results show the effectiveness of the proposed models over the baseline models because the proposed model is very efficient in representing the cross-lingual text very efficiently.

[1]  Chia‐Ming Chang,et al.  Employing word mover's distance for cross‐lingual plagiarized text detection , 2020, ASIST.

[2]  Basant Agarwal,et al.  Cross-lingual plagiarism detection techniques for English-Hindi language pairs , 2019, Journal of Discrete Mathematical Sciences and Cryptography.

[3]  Azadeh Shakery,et al.  Cross-lingual text alignment for fine-grained plagiarism detection , 2018, J. Inf. Sci..

[4]  Heri Ramampiaro,et al.  A Deep Network Model for Paraphrase Detection in Short Text Messages , 2017, Inf. Process. Manag..

[5]  Anak Agung Putri Ratna,et al.  Cross-Language Plagiarism Detection System Using Latent Semantic Analysis and Learning Vector Quantization , 2017, Algorithms.

[6]  Parth Gupta,et al.  Cross-language plagiarism detection over continuous-space- and knowledge graph-based representations of language , 2016, Knowl. Based Syst..

[7]  Marie-Francine Moens,et al.  Bilingual Distributed Word Representations from Document-Aligned Comparable Data , 2015, J. Artif. Intell. Res..

[8]  Alberto Barrón-Cedeño,et al.  Methods for cross-language plagiarism detection , 2013, Knowl. Based Syst..

[9]  Noah A. Smith,et al.  Paraphrase Identification as Probabilistic Quasi-Synchronous Recognition , 2009, ACL.

[10]  Jonathan Loo,et al.  Convolution-deconvolution word embedding: An end-to-end multi-prototype fusion embedding method for natural language processing , 2020, Inf. Fusion.

[11]  George D. C. Cavalcanti,et al.  Combining sentence similarities measures to identify paraphrases , 2018, Comput. Speech Lang..

[12]  Xiaopei Zhang,et al.  Wikipedia-based information content and semantic similarity computation , 2017, Inf. Process. Manag..