Design and Development of a Large Cross-Lingual Plagiarism Corpus for Urdu-English Language Pair

Cross-lingual plagiarism occurs when the source (or original) text(s) is in one language and the plagiarized text is in another language. In recent years, cross-lingual plagiarism detection has attracted the attention of the research community because a large amount of digital text is easily accessible in many languages through online digital repositories and machine translation systems are readily available, making it easier to perform cross-lingual plagiarism and harder to detect it. To develop and evaluate cross-lingual plagiarism detection systems, standard evaluation resources are needed. The majority of earlier studies have developed cross-lingual plagiarism corpora for English and other European language pairs. However, for Urdu-English language pair, the problem of cross-lingual plagiarism detection has not been thoroughly explored although a large amount of digital text is readily available in Urdu and it is spoken in many countries of the world (particularly in Pakistan, India, and Bangladesh). To fulfill this gap, this paper presents a large benchmark cross-lingual corpus for Urdu-English language pair. The proposed corpus contains 2,395 source-suspicious document pairs (540 are automatic translation, 539 are artificially paraphrased, 508 are manually paraphrased, and 808 are nonplagiarized). Furthermore, our proposed corpus contains three types of cross-lingual examples including artificial (automatic translation and artificially paraphrased), simulated (manually paraphrased), and real (nonplagiarized), which have not been previously reported in the development of cross-lingual corpora. Detailed analysis of our proposed corpus was carried out using - gram overlap and longest common subsequence approaches. Using Word unigrams, mean similarity scores of 1.00, 0.68, 0.52, and 0.22 were obtained for automatic translation, artificially paraphrased, manually paraphrased, and nonplagiarized documents, respectively. These results show that documents in the proposed corpus are created using different obfuscation techniques, which makes the dataset more realistic and challenging. We believe that the corpus developed in this study will help to foster research in an underresourced language of Urdu and will be useful in the development, comparison, and evaluation of cross-lingual plagiarism detection systems for Urdu-English language pair. Our proposed corpus is free and publicly available for research purposes.

[1]  Horacio Rodríguez,et al.  Is This a Paraphrase? What Kind? Paraphrase Boundaries and Typology , 2014 .

[2]  Alberto Barrón-Cedeño,et al.  Methods for cross-language plagiarism detection , 2013, Knowl. Based Syst..

[3]  Benno Stein,et al.  Overview of the PAN/CLEF 2015 Evaluation Lab , 2015, CLEF.

[4]  Naomie Salim,et al.  Web Based Cross Language Plagiarism Detection , 2010, 2010 Second International Conference on Computational Intelligence, Modelling and Simulation.

[5]  Tomaz Erjavec,et al.  The JRC-Acquis: A Multilingual Aligned Parallel Corpus with 20+ Languages , 2006, LREC.

[6]  Jayashree Nair,et al.  An efficient English to Hindi machine translation system using hybrid mechanism , 2016, 2016 International Conference on Advances in Computing, Communications and Informatics (ICACCI).

[7]  Alison J. Head,et al.  How Today's College Students use Wikipedia for Course-related Research , 2010, First Monday.

[8]  Ehsan Ullah Munir,et al.  Cross-Language Urdu-English (CLUE) Text Alignment Corpus: Notebook for PAN at CLEF 2015 , 2015, CLEF.

[9]  Udo Kruschwitz,et al.  Creating language resources for under-resourced languages: methodologies, and experiments with Arabic , 2015, Lang. Resour. Evaluation.

[10]  Benno Stein,et al.  Cross-language plagiarism detection , 2011, Lang. Resour. Evaluation.

[11]  Masnizah Mohd,et al.  Arabic-English Cross-language Plagiarism Detection using Winnowing Algorithm , 2014 .

[12]  Paul Rayson,et al.  COUNTER: corpus of Urdu news text reuse , 2017, Lang. Resour. Evaluation.

[13]  Mark Stevenson,et al.  Developing a corpus of plagiarised short answers , 2011, Lang. Resour. Evaluation.

[14]  Guy Judge Plagiarism: bringing economics and education together (with a little help from IT) , 2008 .

[15]  Vasudeva Varma,et al.  Cross Lingual Text Reuse Detection Based on Keyphrase Extraction and Similarity Measures , 2011, FIRE.

[16]  Paolo Rosso,et al.  A systematic study of knowledge graph analysis for cross-language plagiarism detection , 2016, Inf. Process. Manag..

[17]  Philipp Koehn,et al.  A parallel corpus for statistical machine translation , 2005 .

[18]  Sarmad Hussain Complexity of Asian Writing Systems : A Case Study of Nafees Nasta ’ leeq for Urdu , 2003 .

[19]  Paul Rayson,et al.  Measuring Short Text Reuse for the Urdu Language , 2018, IEEE Access.

[20]  Matthias Hagen,et al.  Source Retrieval for Plagiarism Detection from Large Web Corpora: Recent Approaches , 2015, CLEF.

[21]  Benno Stein,et al.  An Evaluation Framework for Plagiarism Detection , 2010, COLING.

[22]  Hsin-Chang Yang,et al.  A Platform Framework for Cross-Lingual Text Relatedness Evaluation and Plagiarism Detection , 2008, 2008 3rd International Conference on Innovative Computing Information and Control.

[23]  Ayu Purwarianti,et al.  The Construction of Indonesian-english Cross Language Plagiarism Detection System Using Fingerprinting Technique , 2012 .

[24]  Didier Schwab,et al.  A Multilingual, Multi-style and Multi-granularity Dataset for Cross-language Textual Similarity Detection , 2016, LREC.

[25]  Alberto Barrón-Cedeño,et al.  PAN@FIRE: Overview of the Cross-Language !ndian Text Re-Use Detection Competition , 2011, FIRE.

[26]  Benno Stein,et al.  Intrinsic Plagiarism Analysis with Meta Learning , 2007, PAN.

[27]  Heshaam Faili,et al.  Developing Bilingual Plagiarism Detection Corpus Using Sentence Aligned Parallel Corpus: Notebook for PAN at CLEF 2015 , 2015, CLEF.

[28]  Kazuaki Kishida,et al.  Technical issues of cross-language information retrieval: a review , 2005, Inf. Process. Manag..

[29]  Rao Muhammad Adeel Nawab,et al.  Mono-lingual Paraphrased Text Reuse and Plagiarism Detection , 2012 .

[30]  and Quality,et al.  Assessing the Accuracy of Google Translate to Allow Data Extraction From Trials Published in Non-English Languages , 2013 .

[31]  Alberto Barrón-Cedeño,et al.  Plagiarism Detection across Distant Language Pairs , 2010, COLING.

[32]  Lalit Agarwal,et al.  Multilingual Plagiarism Detection , 2014 .

[33]  Chris J. Park,et al.  In Other (People's) Words: Plagiarism by university students--literature and lessons , 2003 .

[34]  Susan T. Dumais,et al.  Automatic Cross-Language Information Retrieval Using Latent Semantic Indexing , 1998 .

[35]  Donald L. Mccabe Cheating among college and university students: A North American perspective , 2005 .