CrossLang: the system of cross-lingual plagiarism detection

Plagiarism and text reuse become more available with the Internet development. Therefore it is important to check scientific papers for the fact of cheating, especially in Academia. Existing systems of plagiarism detection show the good performance and have a huge source databases. Thus now it is not enough just to copy the text “as is” from the source document to get the “original” work. Therefore, another type of plagiarism become popular — cross-lingual plagiarism. We present a CrossLang system for such kind of plagiarism detection for English-Russian language pair. The key idea for CrossLang system is that we use the monolingual approach. We have a suspicious Russian document and English reference collection. We reduce the task to one language — we translate the suspicious document into English with the help of machine translation system. After this step we perform the subsequent document analysis. There are two main stages at this analysis: source retrieval stage and document comparison stage. Both of these stages are adapted for our task. At source retrieval stage we need to find the most relevant documents from collection for a given translated suspicious document. Therefore the algorithm is based on aggregation of semantically close words into word classes and thus handles the cases of reformulated passages. The following document comparison is based on phrase embeddings that are trained in unsupervised and semi-supervised regimes. We evaluate CrossLang on the existing and generated datasets. We demonstrate the performance of the whole approach. We integrate the CrossLang in Antiplagiat system (most popular and well-known plagiarism detection system in Russia and CIS) and provide technical characteristics. We also provide the analysis of the system performance. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. KDD 2019, August 2019, Anchorage, Alaska USA © 2019 Copyright held by the owner/author(s). Publication rights licensed to ACM. CCS CONCEPTS • Information systems → Information retrieval; • Computing methodologies → Natural language processing; Learning settings; • Applied computing→ Education.

[1]  Martín Abadi,et al.  TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems , 2016, ArXiv.

[2]  Mikhail Kopotev,et al.  Evaluation Tracks on Plagiarism Detection Algorithms for the Russian Language , 2017 .

[3]  Parth Gupta,et al.  Cross-Language Plagiarism Detection Using a Multilingual Semantic Network , 2013, ECIR.

[4]  Shuai Wang,et al.  Combination of VSM and Jaccard coefficient for external plagiarism detection , 2013, 2013 International Conference on Machine Learning and Cybernetics.

[5]  Anak Agung Putri Ratna,et al.  Analysis on the Effect of Term-Document's Matrix to the Accuracy of Latent-Semantic-Analysis-Based Cross-Language Plagiarism Detection , 2016, ICNCC '16.

[6]  Parth Gupta,et al.  Cross-language plagiarism detection over continuous-space- and knowledge graph-based representations of language , 2016, Knowl. Based Syst..

[7]  Leilei Kong,et al.  Comparisons of keyphrase extraction methods in source retrieval of plagiarism detection , 2015, 2015 4th International Conference on Computer Science and Network Technology (ICCSNT).

[8]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[9]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[10]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[11]  Heng Tao Shen,et al.  Hashing for Similarity Search: A Survey , 2014, ArXiv.

[12]  Mohammed Erritali,et al.  Semantic Similarity/Relatedness for Cross Language Plagiarism Detection , 2016, 2016 13th International Conference on Computer Graphics, Imaging and Visualization (CGiV).

[13]  Kevin Gimpel,et al.  Towards Universal Paraphrastic Sentence Embeddings , 2015, ICLR.

[14]  S. Dumais Latent Semantic Analysis. , 2005 .

[15]  Sabrina Tiun,et al.  Cross-language plagiarism of Arabic-English documents using linear logistic regression , 2016 .

[16]  Azadeh Shakery,et al.  Using a Dictionary and n-gram Alignment to Improve Fine-grained Cross-Language Plagiarism Detection , 2016, DocEng.

[17]  Hugo Larochelle,et al.  An Autoencoder Approach to Learning Bilingual Word Representations , 2014, NIPS.

[18]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[19]  Matthias Hagen,et al.  Overview of the 1st international competition on plagiarism detection , 2009 .

[20]  Anak Agung Putri Ratna,et al.  Cross-Language Plagiarism Detection System Using Latent Semantic Analysis and Learning Vector Quantization , 2017, Algorithms.

[21]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[22]  Cristian Grozea,et al.  ENCOPLOT: Pairwise Sequence Matching in Linear Time Applied to Plagiarism Detection ∗ , 2009 .

[23]  Jimmy J. Lin,et al.  Multi-Perspective Sentence Similarity Modeling with Convolutional Neural Networks , 2015, EMNLP.

[24]  Serhii Vashchilin,et al.  Comparison plagiarism search algorithms implementations , 2017, 2017 2nd International Conference on Advanced Information and Communication Technologies (AICT).

[25]  Quoc V. Le,et al.  Document Embedding with Paragraph Vectors , 2015, ArXiv.

[26]  Naomie Salim,et al.  Fuzzy Semantic Plagiarism Detection , 2012, AMLTA.

[27]  Eneko Agirre,et al.  Unsupervised Neural Machine Translation , 2017, ICLR.

[28]  Quoc V. Le,et al.  Distributed Representations of Sentences and Documents , 2014, ICML.

[29]  David Novak,et al.  Off the Beaten Path: Let's Replace Term-Based Retrieval with k-NN Search , 2016, CIKM.

[30]  Simone Paolo Ponzetto,et al.  BabelNet: Building a Very Large Multilingual Semantic Network , 2010, ACL.

[31]  Paolo Rosso,et al.  A systematic study of knowledge graph analysis for cross-language plagiarism detection , 2016, Inf. Process. Manag..

[32]  Laurent Besacier,et al.  Using Word Embedding for Cross-Language Plagiarism Detection , 2017, EACL.

[33]  Rasim M. Alguliyev,et al.  PDLK: Plagiarism detection using linguistic knowledge , 2015, Expert Syst. Appl..

[34]  Jörg Tiedemann,et al.  Parallel Data, Tools and Interfaces in OPUS , 2012, LREC.

[35]  Simon Suchomel,et al.  Three Way Search Engine Queries with Multi-feature Document Comparison for Plagiarism Detection , 2012, CLEF.

[36]  Alberto Barrón-Cedeño,et al.  Plagiarism Detection across Distant Language Pairs , 2010, COLING.

[37]  Roman Kern,et al.  External and Intrinsic Plagiarism Detection Using a Cross-Lingual Retrieval and Segmentation System - Lab Report for PAN at CLEF 2010 , 2010, CLEF.

[38]  John C. Platt,et al.  Learning Discriminative Projections for Text Similarity Measures , 2011, CoNLL.

[39]  Deyi Xiong,et al.  BattRAE: Bidimensional Attention-Based Recursive Autoencoders for Learning Bilingual Phrase Embeddings , 2016, AAAI.

[40]  Anton S. Khritankov,et al.  Discovering text reuse in large collections of documents: A study of theses in history sciences , 2015, 2015 Artificial Intelligence and Natural Language and Information Extraction, Social Media and Web Search FRUCT Conference (AINL-ISMW FRUCT).

[41]  Jeffrey Pennington,et al.  Semi-Supervised Recursive Autoencoders for Predicting Sentiment Distributions , 2011, EMNLP.

[42]  Guillaume Lample,et al.  Phrase-Based & Neural Unsupervised Machine Translation , 2018, EMNLP.

[43]  Guillaume Lample,et al.  Unsupervised Machine Translation Using Monolingual Corpora Only , 2017, ICLR.

[44]  Christopher D. Manning,et al.  Improved Semantic Representations From Tree-Structured Long Short-Term Memory Networks , 2015, ACL.

[45]  Debotosh Bhattacharjee,et al.  Plagiarism Detection by Identifying the Keywords , 2014, 2014 International Conference on Computational Intelligence and Communication Networks.

[46]  Alexey Romanov,et al.  A monolingual approach to detection of text reuse in Russian-English collection , 2015, 2015 Artificial Intelligence and Natural Language and Information Extraction, Social Media and Web Search FRUCT Conference (AINL-ISMW FRUCT).

[47]  Benno Stein,et al.  A Wikipedia-Based Multilingual Retrieval Model , 2008, ECIR.

[48]  Quoc V. Le,et al.  Grounded Compositional Semantics for Finding and Describing Images with Sentences , 2014, TACL.

[49]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.