Cross-language text alignment: A proposed two-level matching scheme for plagiarism detection

Abstract The exponential growth of documents in various languages throughout the web, along with the availability of several editing and translation tools have made the cross-language plagiarism detection a challenging issue. Regarding its high importance, the present study focuses on the task of cross-language text alignment also known as detailed analysis which works on the outputs of the source retrieval step of cross-language plagiarism detection systems. The paper proposes a two-level matching approach with the aim of considering both syntactic and semantic information to align plagiarism fragments from the source and suspicious documents, accurately. At the first level, a vector space model which employs a multilingual word embeddings based dictionary and a local weighting technique is used in order to extract a minimal set of highly potential candidate fragment pairs rather than considering all possible pairs of fragments. This step also contains a dynamic expansion technique to cover more candidate pairs aiming at improving the system’s recall. It is followed by a more precise algorithm that examines the candidate pairs at the sentence level using a graph-of-words representation of text. As a result, by modelling both the words and their relationships, an acceptable increase in the system’s precision which is the goal of the second level is also observed. To identify evidence of plagiarism, i.e. potential cases of unauthorized text reuse, the algorithm tries to find maximum cliques from the match graph of source and suspicious texts. With this two-level investigation, the approach is capable to discriminate true plagiarism cases from the original text. The experimental results on different datasets such as PAN-PC-11, PAN-PC-12, and SemEval-2017 show that the proposed cross-language text alignment approach significantly outperforms the state-of-the-art models and can be fed into an expert system for further improvement of cross-language plagiarism detection. The source codes are publicly available on GitHub 1 , for the purposes of reproducible research.

[1]  Azadeh Shakery,et al.  Using a Dictionary and n-gram Alignment to Improve Fine-grained Cross-Language Plagiarism Detection , 2016, DocEng.

[2]  Juan D. Velásquez,et al.  Text mining applied to plagiarism detection: The use of words for detecting deviations in the writing style , 2013, Expert Syst. Appl..

[3]  Paolo Rosso,et al.  A systematic study of knowledge graph analysis for cross-language plagiarism detection , 2016, Inf. Process. Manag..

[4]  Benno Stein,et al.  Intrinsic plagiarism analysis , 2011, Lang. Resour. Evaluation.

[5]  Paolo Rosso,et al.  Continuous space models for CLIR , 2017, Inf. Process. Manag..

[6]  Frédéric Cazals,et al.  A note on the problem of reporting maximal cliques , 2008, Theor. Comput. Sci..

[7]  Deepa Gupta,et al.  Unmasking text plagiarism using syntactic-semantic based natural language processing techniques: Comparisons, analysis and challenges , 2018, Inf. Process. Manag..

[8]  Paolo Rosso,et al.  A resource-light method for cross-lingual semantic textual similarity , 2017, Knowl. Based Syst..

[9]  Deepa Gupta,et al.  Text plagiarism classification using syntax based linguistic features , 2017, Expert Syst. Appl..

[10]  James Mayfield,et al.  Character N-Gram Tokenization for European Language Text Retrieval , 2004, Information Retrieval.

[11]  Basant Agarwal,et al.  Cross-lingual plagiarism detection techniques for English-Hindi language pairs , 2019, Journal of Discrete Mathematical Sciences and Cryptography.

[12]  Muyun Yang,et al.  Source Retrieval Model Focused on Aggregation for plagiarism detection , 2019, Inf. Sci..

[13]  John D. Hunter,et al.  Matplotlib: A 2D Graphics Environment , 2007, Computing in Science & Engineering.

[14]  Claire Cardie,et al.  Unsupervised Multilingual Word Embeddings , 2018, EMNLP.

[15]  Felipe Bravo-Marquez,et al.  DOCODE 3.0 (DOcument COpy DEtector): A system for plagiarism detection by applying an information fusion process from multiple documental data sources , 2016, Inf. Fusion.

[16]  Evgeniy Gabrilovich,et al.  Computing Semantic Relatedness Using Wikipedia-based Explicit Semantic Analysis , 2007, IJCAI.

[17]  Alberto Barrón-Cedeño,et al.  On the mono- and cross-language detection of text reuse and plagiarism , 2010, Proces. del Leng. Natural.

[18]  Paolo Rosso,et al.  Determining and characterizing the reused text for plagiarism detection , 2013, Expert Syst. Appl..

[19]  Rasim M. Alguliyev,et al.  PDLK: Plagiarism detection using linguistic knowledge , 2015, Expert Syst. Appl..

[20]  Parth Gupta,et al.  Knowledge Graphs as Context Models: Improving the Detection of Cross-Language Plagiarism with Paraphrasing , 2013, PROMISE Winter School.

[21]  Simone Paolo Ponzetto,et al.  BabelNet: The automatic construction, evaluation and application of a wide-coverage multilingual semantic network , 2012, Artif. Intell..

[22]  Susan T. Dumais,et al.  Automatic Cross-Language Information Retrieval Using Latent Semantic Indexing , 1998 .

[23]  Benno Stein,et al.  Cross-language plagiarism detection , 2011, Lang. Resour. Evaluation.

[24]  Parth Gupta,et al.  Cross-language plagiarism detection over continuous-space- and knowledge graph-based representations of language , 2016, Knowl. Based Syst..

[25]  Rafael Dueire Lins,et al.  Assessing sentence similarity through lexical, syntactic and semantic analysis , 2016, Comput. Speech Lang..

[26]  Tomas Brychcin,et al.  Linear Transformations for Cross-lingual Semantic Textual Similarity , 2018, Knowl. Based Syst..

[27]  Kathleen C. Fraser,et al.  Multilingual word embeddings for the assessment of narrative speech in mild cognitive impairment , 2019, Comput. Speech Lang..

[28]  Mark Stevenson,et al.  Developing a corpus of plagiarised short answers , 2011, Lang. Resour. Evaluation.

[29]  Bamdev Mishra,et al.  Learning Multilingual Word Embeddings in Latent Metric Space: A Geometric Approach , 2018, TACL.

[30]  Guillaume Caumon,et al.  Structural Interpretation of Sparse Fault Data Using Graph Theory and Geological Rules , 2019, Mathematical Geosciences.

[31]  Akira Tanaka,et al.  The worst-case time complexity for generating all maximal cliques and computational experiments , 2006, Theor. Comput. Sci..

[32]  Victor I. Chang,et al.  An integrated approach for intrinsic plagiarism detection , 2017, Future Gener. Comput. Syst..

[33]  Naomie Salim,et al.  Understanding Plagiarism Linguistic Patterns, Textual Features, and Detection Methods , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[34]  Guillaume Lample,et al.  Word Translation Without Parallel Data , 2017, ICLR.

[35]  Alberto Barrón-Cedeño,et al.  On Cross-lingual Plagiarism Analysis using a Statistical Model , 2008, PAN.

[36]  Samuel L. Smith,et al.  Offline bilingual word vectors, orthogonal transformations and the inverted softmax , 2017, ICLR.

[37]  Paolo Rosso,et al.  On the use of word embedding for cross language plagiarism detection , 2019, Intell. Data Anal..

[38]  Rasim M. Alguliyev,et al.  A linguistic treatment for automatic external plagiarism detection , 2017, Knowl. Based Syst..

[39]  Mohammad Hadi Sadreddini,et al.  An effective approach to candidate retrieval for cross-language plagiarism detection: A fusion of conceptual and keyword-based schemes , 2020, Inf. Process. Manag..

[40]  Christopher D. Manning,et al.  Bilingual Word Representations with Monolingual Quality in Mind , 2015, VS@HLT-NAACL.

[41]  Alberto Barrón-Cedeño,et al.  Methods for cross-language plagiarism detection , 2013, Knowl. Based Syst..

[42]  Paolo Rosso,et al.  Paraphrase plagiarism identification with character-level features , 2019, Pattern Analysis and Applications.

[43]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[44]  Azadeh Shakery,et al.  Cross-lingual text alignment for fine-grained plagiarism detection , 2018, J. Inf. Sci..

[45]  C. Bron,et al.  Algorithm 457: finding all cliques of an undirected graph , 1973 .

[46]  Eneko Agirre,et al.  Learning bilingual word embeddings with (almost) no bilingual data , 2017, ACL.

[47]  Azadeh Shakery,et al.  Candidate document retrieval for cross-lingual plagiarism detection using two-level proximity information , 2016, Inf. Process. Manag..

[48]  Nello Cristianini,et al.  Inferring a Semantic Representation of Text via Cross-Language Correlation Analysis , 2002, NIPS.