Cross-lingual text alignment for fine-grained plagiarism detection

Fast and easy access to a wide range of documents in various languages, in conjunction with the wide availability of translation and editing tools, has led to the need to develop effective tools for detecting cross-lingual plagiarism. Given a suspicious document, cross-lingual plagiarism detection comprises two main subtasks: retrieving documents that are candidate sources for that document and analysing those candidates one by one to determine their similarity to the suspicious document. In this article, we examine the second subtask, also called the detailed analysis subtask, where the goal is to align plagiarised fragments from source and suspicious documents in different languages. Our proposed approach has two main steps: the first step tries to find candidate plagiarised fragments and focuses on high recall, followed by a more precise similarity analysis based on dynamic text alignment that will filter the results by finding alignments between the identified fragments. With these two steps, the proximity of the terms will be considered in different levels of granularity. In both steps, our approach uses a dictionary to obtain translations of individual terms instead of using a machine translation system to convert longer passages from one language to another. We used a weighting scheme to distinct multiple translations of the terms. Experimental results show that our method outperforms the methods used by the systems that achieved the best results in the PAN-2012 and PAN-2014 competitions.

[1]  C. J. van Rijsbergen,et al.  Probabilistic models of information retrieval based on measuring the divergence from randomness , 2002, TOIS.

[2]  Benno Stein,et al.  Cross-language plagiarism detection , 2011, Lang. Resour. Evaluation.

[3]  Parth Gupta,et al.  Cross-language plagiarism detection over continuous-space- and knowledge graph-based representations of language , 2016, Knowl. Based Syst..

[4]  Alberto Barrón-Cedeño,et al.  A statistical approach to crosslingual natural language tasks , 2008, LA-NMR.

[5]  James Mayfield,et al.  Character N-Gram Tokenization for European Language Text Retrieval , 2004, Information Retrieval.

[6]  Alberto Barrón-Cedeño,et al.  Methods for cross-language plagiarism detection , 2013, Knowl. Based Syst..

[7]  Philipp Cimiano,et al.  Exploiting Wikipedia for cross-lingual and multilingual information retrieval , 2012, Data Knowl. Eng..

[8]  Juan D. Velásquez,et al.  Text mining applied to plagiarism detection: The use of words for detecting deviations in the writing style , 2013, Expert Syst. Appl..

[9]  Paolo Rosso,et al.  A systematic study of knowledge graph analysis for cross-language plagiarism detection , 2016, Inf. Process. Manag..

[10]  Mark Stevenson,et al.  Developing a corpus of plagiarised short answers , 2011, Lang. Resour. Evaluation.

[11]  Azadeh Shakery,et al.  Candidate document retrieval for cross-lingual plagiarism detection using two-level proximity information , 2016, Inf. Process. Manag..

[12]  Efstathios Stamatatos,et al.  Plagiarism detection using stopword n-grams , 2011, J. Assoc. Inf. Sci. Technol..

[13]  Benno Stein,et al.  A Wikipedia-Based Multilingual Retrieval Model , 2008, ECIR.

[14]  Holger Schwenk,et al.  Continuous space language models , 2007, Comput. Speech Lang..

[15]  Renata de Matos Galante,et al.  A New Approach for Cross-Language Plagiarism Analysis , 2010, CLEF.

[16]  Roberto Navigli,et al.  Entity Linking meets Word Sense Disambiguation: a Unified Approach , 2014, TACL.