COUNTER: corpus of Urdu news text reuse

Text reuse is the act of borrowing text from existing documents to create new texts. Freely available and easily accessible large online repositories are not only making reuse of text more common in society but also harder to detect. A major hindrance in the development and evaluation of existing/new mono-lingual text reuse detection methods, especially for South Asian languages, is the unavailability of standardized benchmark corpora. Amongst other things, a gold standard corpus enables researchers to directly compare existing state-of-the-art methods. In our study, we address this gap by developing a benchmark corpus for one of the widely spoken but under resourced languages i.e. Urdu. The COrpus of Urdu News TExt Reuse (COUNTER) corpus contains 1200 documents with real examples of text reuse from the field of journalism. It has been manually annotated at document level with three levels of reuse: wholly derived, partially derived and non derived. We also apply a number of similarity estimation methods on our corpus to show how it can be used for the development, evaluation and comparison of text reuse detection systems for the Urdu language. The corpus is a vital resource for the development and evaluation of text reuse detection systems in general and specifically for Urdu language.

[1]  Justin Zobel,et al.  Methods for Identifying Versioned and Plagiarized Documents , 2003, J. Assoc. Inf. Sci. Technol..

[2]  W. Bruce Croft,et al.  Finding text reuse on the web , 2009, WSDM '09.

[3]  Tony McEnery,et al.  Corpus Resources and Minority Language Engineering , 2000, LREC.

[4]  James A. Malcolm,et al.  Detecting Short Passages of Similar Text in Large Document Collections , 2001, EMNLP.

[5]  G. Yule ON SENTENCE- LENGTH AS A STATISTICAL CHARACTERISTIC OF STYLE IN PROSE: WITH APPLICATION TO TWO CASES OF DISPUTED AUTHORSHIP , 1939 .

[6]  Dawn Archer,et al.  Extracting Multiword Expressions with A Semantic Tagger , 2003, ACL 2003.

[7]  อนิรุธ สืบสิงห์,et al.  Data Mining Practical Machine Learning Tools and Techniques , 2014 .

[8]  D. Thenmozhi,et al.  Paraphrase Identification by Using Clause-Based Similarity Features and Machine Translation Metrics , 2016, Comput. J..

[9]  Alberto Barrón-Cedeño,et al.  Reducing the Plagiarism Detection Search Space on the Basis of the Kullback-Leibler Distance , 2009, CICLing.

[10]  Naomie Salim,et al.  Survey of Text Plagiarism Detection , 2012 .

[11]  Nicholas Tran,et al.  Sim: a utility for detecting similarity in computer programs , 1999, SIGCSE '99.

[12]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[13]  Benno Stein,et al.  An Evaluation Framework for Plagiarism Detection , 2010, COLING.

[14]  W. Bruce Croft,et al.  Local text reuse detection , 2008, SIGIR '08.

[15]  Carlo Strapparava,et al.  Corpus-based and Knowledge-based Measures of Text Semantic Similarity , 2006, AAAI.

[16]  Ralf Steinmetz,et al.  Automatic Detection of Local Reuse , 2010, EC-TEL.

[17]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[18]  Yiming Yang,et al.  RCV1: A New Benchmark Collection for Text Categorization Research , 2004, J. Mach. Learn. Res..

[19]  Sergey Butakov,et al.  The toolbox for local and global plagiarism detection , 2009, Comput. Educ..

[20]  Hermann A. Maurer,et al.  Plagiarism - A Survey , 2006, J. Univers. Comput. Sci..

[21]  Alberto Barrón-Cedeño,et al.  Plagiarism Meets Paraphrasing: Insights for the Next Generation in Automatic Plagiarism Detection , 2013, CL.

[22]  Jacob Cohen A Coefficient of Agreement for Nominal Scales , 1960 .

[23]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[24]  Iryna Gurevych,et al.  Text Reuse Detection using a Composition of Text Similarity Measures , 2012, COLING.

[25]  Grigori Sidorov,et al.  A Winning Approach to Text Alignment for Text Reuse Detection at PAN 2014 , 2014, CLEF.

[26]  Ehsan Ullah Munir,et al.  Cross-Language Urdu-English (CLUE) Text Alignment Corpus: Notebook for PAN at CLEF 2015 , 2015, CLEF.

[27]  Rui Sousa-Silva,et al.  Detecting translingual plagiarism and the backlash against translation plagiarists , 2014 .

[28]  Kashif Riaz,et al.  A Study in Urdu Corpus Construction , 2002, ALR@COLING.

[29]  Paolo Rosso,et al.  Determining and characterizing the reused text for plagiarism detection , 2013, Expert Syst. Appl..

[30]  Michael J. Wise Detection of similarities in student programs: YAP'ing may be preferable to plague'ing , 1992, SIGCSE '92.

[31]  W. Bruce Croft,et al.  Evaluating text reuse discovery on the web , 2010, IIiX.

[32]  Hector Garcia-Molina,et al.  SCAM: A Copy Detection Mechanism for Digital Documents , 1995, DL.

[33]  C. Lyon,et al.  Demonstration of the Ferret Plagiarism Detector , 2006 .

[34]  Kathleen McKeown,et al.  The decomposition of human-written summary sentences , 1999, SIGIR '99.

[35]  A. Bell The language of news media , 1991 .

[36]  Ophir Frieder,et al.  Collection statistics for fast duplicate document detection , 2002, TOIS.

[37]  Hector Garcia-Molina,et al.  Copy detection mechanisms for digital documents , 1995, SIGMOD '95.

[38]  Iraklis Varlamis,et al.  Text Relatedness Based on a Word Thesaurus , 2010, J. Artif. Intell. Res..

[39]  Tony McEnery,et al.  A corpus-based approach to text reuse in the newsbooks of the Commonwealth , 2010 .

[40]  W. Bruce Croft,et al.  Similarity measures for tracking information flow , 2005, CIKM '05.

[41]  Horacio Rodríguez,et al.  Is This a Paraphrase? What Kind? Paraphrase Boundaries and Typology , 2014 .

[42]  Jacob Cohen,et al.  Weighted kappa: Nominal scale agreement provision for scaled disagreement or partial credit. , 1968 .

[43]  Mark Stevenson,et al.  Developing a corpus of plagiarised short answers , 2011, Lang. Resour. Evaluation.

[44]  Efstathios Stamatatos,et al.  Plagiarism detection using stopword n-grams , 2011, J. Assoc. Inf. Sci. Technol..

[45]  Waqas Anwar,et al.  A Survey of Automatic Urdu Language Processing , 2006, 2006 International Conference on Machine Learning and Cybernetics.

[46]  Per Runeson,et al.  Detection of Duplicate Defect Reports Using Natural Language Processing , 2007, 29th International Conference on Software Engineering (ICSE'07).

[47]  Yorick Wilks,et al.  The METER corpus : a corpus for analysing journalistic text reuse , 2001 .

[48]  Tanya Aplin Reflections on Measuring Text Re-Use from a Copyright Law Perspective , 2010 .

[49]  Benno Stein,et al.  PAN Plagiarism Corpus PAN-PC-09 , 2009 .

[50]  Yorick Wilks,et al.  Measuring Text Reuse , 2002, ACL.

[51]  Andrei Z. Broder,et al.  On the resemblance and containment of documents , 1997, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171).

[52]  James H. Martin,et al.  Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition , 2000 .

[53]  Matthias Hagen,et al.  Overview of the 1st international competition on plagiarism detection , 2009 .