On the mono- and cross-language detection of text reuse and plagiarism

Plagiarism, the unacknowledged reuse of text, has increased in recent years due to the large amount of texts readily available. For instance, recent studies claim that nowadays a high rate of student reports include plagiarism, making manual plagiarism detection practically infeasible. Automatic plagiarism detection tools assist experts to analyse documents for plagiarism. Nevertheless, the lack of standard collections with cases of plagiarism has prevented accurate comparing models, making differences hard to appreciate. Seminal efforts on the detection of text reuse [2] have fostered the composition of standard resources for the accurate evaluation and comparison of methods. The aim of this PhD thesis is to address three of the main problems in the development of better models for automatic plagiarism detection: (i) the adequate identification of good potential sources for a given suspicious text; (ii) the detection of plagiarism despite modifications, such as words substitution and paraphrasing (special stress is given to cross-language plagiarism); and (iii) the generation of standard collections of cases of plagiarism and text reuse in order to provide a framework for accurate comparison of models. Regarding difficulties (i) and (ii) , we have carried out preliminary experiments over the METER corpus [2]. Given a suspicious document dq and a collection of potential source documents D, the process is divided in two steps. First, a small subset of potential source documents D* in D is retrieved. The documents d in D* are the most related to dq and, therefore, the most likely to include the source of the plagiarised fragments in it. We performed this stage on the basis of the Kullback-Leibler distance, over a subsample of document's vocabularies. Afterwards, a detailed analysis is carried out comparing dq to every d in D* in order to identify potential cases of plagiarism and their source. This comparison was made on the basis of word n-grams, by considering n = {2, 3}. These n-gram levels are flexible enough to properly retrieve plagiarised fragments and their sources despite modifications [1]. The result is offered to the user to take the final decision. Further experiments were done in both stages in order to compare other similarity measures, such as the cosine measure, the Jaccard coefficient and diverse fingerprinting and probabilistic models. One of the main weaknesses of currently available models is that they are unable to detect cross-language plagiarism. Approaching the detection of this kind of plagiarism is of high relevance, as the most of the information published is written in English, and authors in other languages may find it attractive to make use of direct translations. Our experiments, carried out over parallel and a comparable corpora, show that models of "standard" cross-language information retrieval are not enough. In fact, if the analysed source and target languages are related in some way (common linguistic ancestors or technical vocabulary), a simple comparison based on character n-grams seems to be the option. However, in those cases where the relation between the implied languages is weaker, other models, such as those based on statistical machine translation, are necessary [3]. We plan to perform further experiments, mainly to approach the detection of cross-language plagiarism. In order to do that, we will use the corpora developed under the framework of the PAN competition on plagiarism detection (cf. PAN@CLEF: http://pan.webis.de). Models that consider cross-language thesauri and comparison of cognates will also be applied.

[1]  Zhang Ling,et al.  A Cluster-Based Plagiarism Detection Method - Lab Report for PAN at CLEF 2010 , 2010, CLEF.

[2]  Paul Newman,et al.  Forensic linguistics: An introduction to language, crime and the law (review) , 2008 .

[3]  Hector Garcia-Molina,et al.  Copy detection mechanisms for digital documents , 1995, SIGMOD '95.

[4]  Mehdi Mohammadi,et al.  Building Bilingual Parallel Corpora Based on Wikipedia , 2010, 2010 Second International Conference on Computer Engineering and Applications.

[5]  Eiichiro Sumita,et al.  Method for Building Sentence-Aligned Corpus from Wikipedia , 2008 .

[6]  Navot Akiva Using Clustering to Identify Outlier Chunks of Text - Notebook for PAN at CLEF 2011 , 2011, CLEF.

[7]  Philipp Koehn,et al.  Europarl: A Parallel Corpus for Statistical Machine Translation , 2005, MTSUMMIT.

[8]  Alexander M. Fraser,et al.  Improved Machine Translation Performance via Parallel Sentence Extraction from Comparable Corpora , 2004, NAACL.

[9]  David Sharp,et al.  Technical Review of Plagiarism Detection Software Report , 2001 .

[10]  Vasudeva Varma,et al.  Cross Lingual Text Reuse Detection Based on Keyphrase Extraction and Similarity Measures , 2011, FIRE.

[11]  Gregory L. Anderson Cyberplagiarism: A look at the Web term paper sites , 1999 .

[12]  Krishnendu Chatterjee,et al.  Assigning trust to Wikipedia content , 2008, Int. Sym. Wikis.

[13]  Lluís Padró,et al.  FreeLing 1.3: Syntactic and semantic services in an open-source NLP library , 2006, LREC.

[14]  Gabriela Topa Cantisano,et al.  Plagio y otras prácticas académicamente incorrectas entre el alumnado universitario de nuevo ingreso: [Póster] , 2011 .

[15]  George Karypis,et al.  Hierarchical Clustering Algorithms for Document Datasets , 2005, Data Mining and Knowledge Discovery.

[16]  Shanmugasundaram Hariharan,et al.  A Comparison of Similarity Measures for Text Documents , 2008, J. Inf. Knowl. Manag..

[17]  Brigitte Bigi,et al.  Using Kullback-Leibler Distance for Text Categorization , 2003, ECIR.

[18]  Stephan Vogel,et al.  Parallel Implementations of Word Alignment Tool , 2008, SETQALNLP.

[19]  Sivaji Bandyopadhyay,et al.  Rule Based Plagiarism Detection using Information Retrieval - Notebook for PAN at CLEF 2011 , 2011, CLEF.

[20]  Marta R. Costa-jussà,et al.  Plagiarism Detection Using Information Retrieval and Similarity Measures Based on Image Processing Techniques - Lab Report for PAN at CLEF 2010 , 2010, CLEF.

[21]  Deborah L. McGuinness,et al.  Mining Revision History to Assess Trustworthiness of Article Fragments , 2006, 2006 International Conference on Collaborative Computing: Networking, Applications and Worksharing.

[22]  Mitsuo Shimohata,et al.  Acquiring Paraphrases from Corpora and Its Application to Machine Translation , 2004 .

[23]  Paolo Rosso,et al.  Towards Document Plagiarism Detection Based on the Relevance and Fragmentation of the Reused Text , 2010, MICAI.

[24]  Janis Grundspenkis,et al.  Computer-based plagiarism detection methods and tools: an overview , 2007, CompSysTech '07.

[25]  Federico Gaspari,et al.  Detecting Inappropriate Use of Free Online Machine Translation by Language Students. A Special Case of Plagiarism Detection , 2006, EAMT.

[26]  Mark Stevenson,et al.  External Plagiarism Detection using Information Retrieval and Sequence Alignment - Notebook for PAN at CLEF 2011 , 2011, CLEF.

[27]  Roman Kern,et al.  External and Intrinsic Plagiarism Detection Using Vector Space Models , 2009 .

[28]  Chris Quirk,et al.  Monolingual Machine Translation for Paraphrase Generation , 2004, EMNLP.

[29]  K. J. Ottenstein An algorithmic approach to the detection and prevention of plagiarism , 1976, SGCS.

[30]  Huaiyu Zhu On Information and Sufficiency , 1997 .

[31]  José Carlos González,et al.  A Plagiarism Detector for Intrinsic Plagiarism - Lab Report for PAN at CLEF 2010 , 2010, CLEF.

[32]  Benno Stein,et al.  Automatic Vandalism Detection in Wikipedia , 2008, ECIR.

[33]  Abraham Lempel,et al.  A universal algorithm for sequential data compression , 1977, IEEE Trans. Inf. Theory.

[34]  Patrick Schone,et al.  Mining Wiki Resources for Multilingual Named Entity Recognition , 2008, ACL.

[35]  Michael J. Wise,et al.  Running Karp-Rabin Matching and Greedy String Tiling , 2003 .

[36]  Robert Dale,et al.  United Nations General Assembly Resolutions : a six-language parallel corpus , 2009 .

[37]  Pascale Fung,et al.  Mining Very-Non-Parallel Corpora: Parallel Sentence and Lexicon Extraction via Bootstrapping and E , 2004, EMNLP.

[38]  M. Kendall The Statistical Study of Literary Vocabulary , 1944, Nature.

[39]  Michael Skinner,et al.  Information arbitrage across multi-lingual Wikipedia , 2009, WSDM '09.

[40]  Eduard Hovy,et al.  Learning paraphrases from text , 2009 .

[41]  Sven Hartrumpf,et al.  Semantic Duplicate Identification with Parsing and Machine Learning , 2010, TSD.

[42]  Roman Kern,et al.  External and Intrinsic Plagiarism Detection Using a Cross-Lingual Retrieval and Segmentation System - Lab Report for PAN at CLEF 2010 , 2010, CLEF.

[43]  Mark Dras,et al.  Tree adjoining grammar and the reluctant paraphrasing of text , 1999 .

[44]  Moshe Koppel,et al.  Translationese and Its Dialects , 2011, ACL.

[45]  Simone Santini,et al.  Similarity Measures , 1999, IEEE Trans. Pattern Anal. Mach. Intell..

[46]  Chris Callison-Burch,et al.  Paraphrasing and translation , 2007 .

[47]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[48]  Benno Stein,et al.  Intrinsic Plagiarism Analysis with Meta Learning , 2007, PAN.

[49]  Shlomo Argamon,et al.  Overview of the International Authorship Identification Competition at PAN-2011 , 2011, CLEF.

[50]  H. Sichel On a Distribution Law for Word Frequencies , 1975 .

[51]  Karl O. Jones,et al.  Cyber Cheating in an Information Technology Age , 2008 .

[52]  Bruno Pouliquen,et al.  Cross-Lingual Document Similarity Calculation Using the Multilingual Thesaurus EUROVOC , 2002, CICLing.

[53]  Kenneth J. Chapman,et al.  Academic dishonesty in a global educational market: a comparison of Hong Kong and American university business students , 2004 .

[54]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[55]  Atsushi Fujita,et al.  Automatic Generation of Syntactically Well-formed and Semantically Appropriate Paraphrases , 2005 .

[56]  John Gibbons,et al.  Dimensions of forensic linguistics , 2008 .

[57]  Ulrik Brandes,et al.  Revision and Co-revision in Wikipedia : Detecting Clusters of Interest , 2007 .

[58]  Benno Stein,et al.  Corpus and Evaluation Measures for Automatic Plagiarism Detection , 2010, LREC.

[59]  Marta Recasens,et al.  On Paraphrase and Coreference , 2010, Computational Linguistics.

[60]  M. Coulthard Author Identification, Idiolect, and Linguistic Uniqueness. , 2004 .

[61]  Alberto Barrón-Cedeño,et al.  PAN@FIRE: Overview of the Cross-Language !ndian Text Re-Use Detection Competition , 2011, FIRE.

[62]  Maria Teresa Turell La tasca del lingüista detectiu en casos de detecció de plagi i determinació d'autoria de textos escrits , 2011 .

[63]  Gultekin Özsoyoglu,et al.  Evaluating Publication Similarity Measures , 2005, IEEE Data Eng. Bull..

[64]  Parvati Iyer,et al.  Document Similarity Analysis for a Plagiarism Detection System , 2005, IICAI.

[65]  Alberto Barrón-Cedeño,et al.  Detecting source code reuse across programming languages , 2011 .

[66]  W. Bruce Croft,et al.  Finding text reuse on the web , 2009, WSDM '09.

[67]  Péter Gács,et al.  Information Distance , 1998, IEEE Trans. Inf. Theory.

[68]  C. J. Morgan,et al.  Student cheating: an ethical dilemma , 1992, Proceedings. Twenty-Second Annual conference Frontiers in Education.

[69]  Robert L. Mercer,et al.  But Dictionaries Are Data Too , 1993, HLT.

[70]  Maria Leonor Pacheco,et al.  of the Association for Computational Linguistics: , 2001 .

[71]  Martin Potthast,et al.  Overview of the 1st International Competition on Wikipedia Vandalism Detection , 2010, CLEF.

[72]  Silvia Bernardini,et al.  A New Approach to the Study of Translationese: Machine-learning the Difference between Original and Translated Text , 2005, Lit. Linguistic Comput..

[73]  Walter Daelemans,et al.  Intrinsic Plagiarism Detection Using Character Trigram Distance Scores - Notebook for PAN at CLEF 2011 , 2011, CLEF.

[74]  Naomie Salim,et al.  Web Based Cross Language Plagiarism Detection , 2010, 2010 Second International Conference on Computational Intelligence, Modelling and Simulation.

[75]  R. Comas Forgas,et al.  ACADEMIC CYBERPLAGIARISM: A DESCRIPTIVE AND COMPARATIVE ANALYSIS OF THE PREVALENCE AMONGST THE UNDERGRADUATE STUDENTS AT TECMILENIO UNIVERSITY (MEXICO) AND BALEARIC ISLANDS UNIVERSITY (SPAIN) , 2010 .

[76]  Máté Pataki Distributed similarity and plagiarism search , 2006 .

[77]  Benno Stein,et al.  Plagiarism Detection Without Reference Collections , 2006, GfKl.

[78]  Competitor enriquevallesbalaguer Putting Ourselves in SME’s Shoes: Automatic Detection of Plagiarism by the WCopyFind tool , 2009 .

[79]  Marta Vila,et al.  Detección automática de plagio: de la copia exacta a la paráfrasis , 2010 .

[80]  Karen Spärck Jones A statistical interpretation of term specificity and its application in retrieval , 2021, J. Documentation.

[81]  Rynson W. H. Lau,et al.  CHECK: a document plagiarism detection system , 1997, SAC '97.

[82]  Daniel Shawcross Wilkerson,et al.  Winnowing: local algorithms for document fingerprinting , 2003, SIGMOD '03.

[83]  Hermann A. Maurer,et al.  Plagiarism - A Survey , 2006, J. Univers. Comput. Sci..

[84]  Alberto Barrón-Cedeño,et al.  Reducing the Plagiarism Detection Search Space on the Basis of the Kullback-Leibler Distance , 2009, CICLing.

[85]  Kenneth Ward Church,et al.  Dotplot : a program for exploring self-similarity in millions of lines of text and code , 1993 .

[86]  冯占省 法律语言学研究具有明显的司法实践性——解读An Introduction to Forensic Linguistics:Language in Evidence , 2010 .

[87]  Paul Clough,et al.  Old and new challenges in automatic plagiarism detection , 2003 .

[88]  Alberto Barrón-Cedeño,et al.  Towards the 2nd International Competition on Plagiarism Detection and Beyond , 2010 .

[89]  Mirna Adriani,et al.  Automatic external plagiarism detection using passage similarities , 2010 .

[90]  Mikel L. Forcada,et al.  An Open-Source Shallow-Transfer Machine Translation Toolbox: Consequences of Its Release and Availability , 2005, MTSUMMIT.

[91]  Alberto Barrón-Cedeño,et al.  A statistical approach to crosslingual natural language tasks , 2008, LA-NMR.

[92]  L. Baum,et al.  An inequality and associated maximization technique in statistical estimation of probabilistic functions of a Markov process , 1972 .

[93]  Robert Krovetz,et al.  Viewing morphology as an inference process , 1993, Artif. Intell..

[94]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[95]  L. R. Dice Measures of the Amount of Ecologic Association Between Species , 1945 .

[96]  Alberto Barrón-Cedeño,et al.  Towards the Detection of Cross-Language Source Code Reuse , 2011, NLDB.

[97]  Hector Garcia-Molina,et al.  SCAM: A Copy Detection Mechanism for Digital Documents , 1995, DL.

[98]  Alberto Barrón-Cedeño,et al.  On Cross-lingual Plagiarism Analysis using a Statistical Model , 2008, PAN.

[99]  András Kornai,et al.  Mathematical Linguistics , 2007, Advanced Information and Knowledge Processing.

[100]  Christian S. Collberg,et al.  Self-plagiarism in computer science , 2005, CACM.

[101]  Elif Yamangil,et al.  Mining Wikipedia's Article Revision History for Training Computational Linguistics Algorithms , 2008 .

[102]  F. Taylor,et al.  Cryptomnesia and Plagiarism , 1965, British Journal of Psychiatry.

[103]  Tomaz Erjavec,et al.  The JRC-Acquis: A Multilingual Aligned Parallel Corpus with 20+ Languages , 2006, LREC.

[104]  L. Talmy Lexicalisation patterns: semantic structure in lexical forms , 1985 .

[105]  Neil Cooke,et al.  A High-performance Plagiarism Detection System - Notebook for PAN at CLEF 2011 , 2011, CLEF.

[106]  James A. Malcolm,et al.  A theoretical basis to the automated detection of copying between texts, and its practical implementation in the Ferret plagiarism and collusion detector , 2004 .

[107]  Fintan Culwin,et al.  Classifications of plagiarism detection engines , 2005 .

[108]  Cristian Grozea,et al.  Who's the Thief? Automatic Detection of the Direction of Plagiarism , 2010, CICLing.

[109]  Naomie Salim,et al.  Fuzzy Semantic-Based String Similarity for Extrinsic Plagiarism Detection - Lab Report for PAN at CLEF 2010 , 2010, CLEF.

[110]  Lucia Specia,et al.  Using Natural Language Processing for Automatic Detection of Plagiarism , 2010 .

[111]  Peter W. Culicover,et al.  Paraphrase generation and information retrieval from stored text , 1968, Mech. Transl. Comput. Linguistics.

[112]  梶原 寿,et al.  Martin Luther King,Jr.における宗教 , 1969 .

[113]  Pamela Samuelson,et al.  Self-plagiarism or fair use , 1994, CACM.

[114]  Alison J. Head,et al.  How Today's College Students use Wikipedia for Course-related Research , 2010, First Monday.

[115]  Rubén Comas-Forgas,et al.  Academic plagiarism prevalence among Spanish undergraduate students: an exploratory analysis , 2010 .

[116]  Alberto Barrón-Cedeño,et al.  English-Spanish Large Statistical Dictionary of Inflectional Forms , 2010, LREC.

[117]  Nello Cristianini,et al.  Inferring a Semantic Representation of Text via Cross-Language Correlation Analysis , 2002, NIPS.

[118]  Dekai Wu,et al.  An Evaluation of MT Alignment Baseline Approaches upon Cross-Lingual Plagiarism Detection , 2011 .

[119]  Jöran Beel,et al.  Comparative evaluation of text- and citation-based plagiarism detection approaches using guttenplag , 2011, JCDL '11.

[120]  Paolo Rosso,et al.  Clustering Narrow-Domain Short Texts by Using the Kullback-Leibler Distance , 2009, CICLing.

[121]  Casey Keck,et al.  The use of paraphrase in summary writing: A comparison of L1 and L2 writers , 2006 .

[122]  Sergey Butakov,et al.  Using Microsoft SQL Server platform for plagiarism detection , 2009 .

[123]  G. Diekhoff,et al.  College cheating: Immaturity, lack of commitment, and the neutralizing attitude , 1986 .

[124]  Kenneth Ward Church,et al.  A Program for Aligning Sentences in Bilingual Corpora , 1993, CL.

[125]  Maarten de Rijke,et al.  Finding Similar Sentences across Multiple Languages in Wikipedia , 2006 .

[126]  Fintan Culwin A Longitudinal Study of Nonoriginal Content in Final-Year Computing Undergraduate Projects , 2008, IEEE Transactions on Education.

[127]  Paul M. B. Vitányi,et al.  An Introduction to Kolmogorov Complexity and Its Applications , 1993, Graduate Texts in Computer Science.

[128]  Andreas Eisele,et al.  Improving Machine Translation Performance Using Comparable Corpora , 2010 .

[129]  Thomas Gottron External Plagiarism Detection Based on Standard IR Technology and Fast Recognition of Common Subsequences - Lab Report for PAN at CLEF 2010 , 2010, CLEF.

[130]  Andrew Lih,et al.  Wikipedia as Participatory Journalism: Reliable Sources? Metrics for evaluating collaborative media as a news resource , 2004 .

[131]  Whatsisname Devil's Dictionary , 1958 .

[132]  Prasenjit Majumder,et al.  External & Intrinsic Plagiarism Detection: VSM & Discourse Markers based Approach - Notebook for PAN at CLEF 2011 , 2011, CLEF.

[133]  Mark Stevenson,et al.  University of Sheffield - Lab Report for PAN at CLEF 2010 , 2010, CLEF.

[134]  Malek Boualem,et al.  Query translation using Wikipedia-based resources for analysis and disambiguation , 2010, EAMT.

[135]  Alex Acero,et al.  Spoken Language Processing , 2001 .

[136]  Ronald L. Rivest,et al.  The MD5 Message-Digest Algorithm , 1992, RFC.

[137]  Parth Gupta,et al.  Mapping Hindi-English Text Re-use Document Pairs , 2011, FIRE.

[138]  Carles Blanch-Mur,et al.  Nivel de conducta académica deshonesta entre los estudiantes de una escuela de ciencias de la salud , 2006 .

[139]  Robert J. Gaizauskas,et al.  Building and annotating a corpus for the study of journalistic text reuse , 2002, LREC.

[140]  Alberto Barrón-Cedeño,et al.  Monolingual and Crosslingual Plagiarism Detection Towards the Competition @ SEPLN 09 ⋆ , 2009 .

[141]  H. Redkey,et al.  A new approach. , 1967, Rehabilitation record.

[142]  Santiago Cavanillas Cyberplagiarism in University Regulations , 2008 .

[143]  Benno Stein,et al.  Netspeak - Assisting Writers in Choosing Words , 2010, ECIR.

[144]  Mary Jane Irwin,et al.  Plagiarism on the rise , 2006, CACM.

[145]  Benno Stein,et al.  Fourth international workshop on uncovering plagiarism, authorship, and social software misuse , 2011, SIGF.

[146]  Alberto Barrón-Cedeño,et al.  Word Length n-Grams for Text Re-use Detection , 2010, CICLing.

[147]  Iryna Gurevych,et al.  Extracting Lexical Semantic Knowledge from Wikipedia and Wiktionary , 2008, LREC.

[148]  Tommy W. S. Chow,et al.  A coarse-to-fine framework to efficiently thwart plagiarism , 2011, Pattern Recognit..

[149]  David Buttler,et al.  A Short Survey of Document Structure Similarity Algorithms , 2004, International Conference on Internet Computing.

[150]  Alberto Barrón-Cedeño,et al.  On Automatic Plagiarism Detection Based on n-Grams Comparison , 2009, ECIR.

[151]  Donald B. Rubin,et al.  Max-imum Likelihood from Incomplete Data , 1972 .

[152]  Paolo Rosso,et al.  On the relevance of search space reduction in automatic plagiarism detection , 2009 .

[153]  Hermann Ney,et al.  A Systematic Comparison of Various Statistical Alignment Models , 2003, CL.

[154]  Paolo Rosso,et al.  Clustering Abstracts of Scientific Texts Using the Transition Point Technique , 2006, CICLing.

[155]  Kathleen R. McKeown,et al.  Information fusion for multidocument summarization: paraphrasing and generation , 2003 .

[156]  Máté Pataki,et al.  Comparison of Overlap Detection Techniques , 2002, International Conference on Computational Science.

[157]  Mark Gerstein,et al.  Data Mining on the Web , 2006, Science.

[158]  Benno Stein,et al.  Cross-language plagiarism detection , 2011, Lang. Resour. Evaluation.

[159]  Alberto Barrón-Cedeño,et al.  A Comparison of Models over Wikipedia Articles Revisions , 2009 .

[160]  Paolo Rosso,et al.  Wikipedia Vandalism Detection: Combining Natural Language, Metadata, and Reputation Features , 2011, CICLing.

[161]  Robert L. Mercer,et al.  Aligning Sentences in Parallel Corpora , 1991, ACL.

[162]  Peter C. R. Lane,et al.  Tackling the PAN’09 External Plagiarism Detection Corpus with a Desktop Plaigiarism Detector , 2009 .

[163]  Benno Stein,et al.  Intrinsic Plagiarism Detection , 2006, ECIR.

[164]  Regina Barzilay,et al.  Information Fusion in the Context of Multi-Document Summarization , 1999, ACL.

[165]  Gail Wood,et al.  Academic original sin: Plagiarism, the Internet, and librarians , 2004 .

[166]  Jan Kasprzak,et al.  Finding Plagiarism by Evaluating Document Similarities , 2009 .

[167]  Efstathios Stamatatos,et al.  A survey of modern authorship attribution methods , 2009, J. Assoc. Inf. Sci. Technol..

[168]  Yorick Wilks,et al.  Measuring Text Reuse , 2002, ACL.

[169]  Sally B. Mitchell,et al.  Encyclopedia of Forensic Science , 2004 .

[170]  Justin Zobel,et al.  A Scalable System for Identifying Co-derivative Documents , 2004, SPIRE.

[171]  R. P. Fishburne,et al.  Derivation of New Readability Formulas (Automated Readability Index, Fog Count and Flesch Reading Ease Formula) for Navy Enlisted Personnel , 1975 .

[172]  Martin Gellerstam,et al.  Translationese in Swedish novels translated from English , 1986 .

[173]  Rada Mihalcea,et al.  Using Wikipedia for Automatic Word Sense Disambiguation , 2007, NAACL.

[174]  Evgeniy Gabrilovich,et al.  Computing Semantic Relatedness Using Wikipedia-based Explicit Semantic Analysis , 2007, IJCAI.

[175]  Benno Stein,et al.  An Evaluation Framework for Plagiarism Detection , 2010, COLING.

[176]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[177]  Hwan-Gue Cho,et al.  Detecting and tracing plagiarized documents by reconstruction plagiarism-evolution tree , 2008, 2008 8th IEEE International Conference on Computer and Information Technology.

[178]  Jaume Sureda Negre,et al.  PRÁCTICAS DE CITACIÓN Y PLAGIO ACADÉMICO EN LA ELABORACIÓN TEXTUAL DEL ALUMNADO UNIVERSITARIO , 2011 .

[179]  Rubén Comas,et al.  Academic Cyberplagiarism: Tracing the causes to reach solutions , 2008 .

[180]  Hsin-Chang Yang,et al.  A Platform Framework for Cross-Lingual Text Relatedness Evaluation and Plagiarism Detection , 2008, 2008 3rd International Conference on Innovative Computing Information and Control.

[181]  Bruno Pouliquen,et al.  Automatic Identification of Document Translations in Large Multilingual Document Collections , 2006, ArXiv.

[182]  Benno Stein,et al.  Strategies for retrieving plagiarized documents , 2007, SIGIR.

[183]  Robert L. Mercer,et al.  The Mathematics of Statistical Machine Translation: Parameter Estimation , 1993, CL.

[184]  Michel Simard,et al.  Using cognates to align sentences in bilingual corpora , 1993, TMI.

[185]  Chris Brockett,et al.  Automatically Constructing a Corpus of Sentential Paraphrases , 2005, IJCNLP.

[186]  Ortega Soto,et al.  Wikipedia: A quantitative analysis , 2012 .

[187]  Efstathios Stamatatos Plagiarism detection based on structural information , 2011, CIKM '11.

[188]  Ming Li,et al.  An Introduction to Kolmogorov Complexity and Its Applications , 2019, Texts in Computer Science.

[189]  Benno Stein,et al.  Intrinsic plagiarism analysis , 2011, Lang. Resour. Evaluation.

[190]  Houda Bouamor,et al.  Local modifications and paraphrases in Wikipedia's revision history , 2011, Proces. del Leng. Natural.

[191]  Paul Clough,et al.  Plagiarism in natural and programming languages: an overview of current tools and technologies , 2000 .

[192]  Philippe Langlais,et al.  Identifying Parallel Documents from a Large Bilingual Collection of Texts: Application to Parallel Article Extraction in Wikipedia. , 2011, BUCC@ACL.

[193]  Rudolf Ravas,et al.  Improved Implementation for Finding Text Similarities in Large Sets of Data - Notebook for PAN at CLEF 2011 , 2011, CLEF.

[194]  Efstathios Stamatatos,et al.  Intrinsic Plagiarism Detection Using Character n-gram Profiles , 2009 .

[195]  Per Runeson,et al.  Detection of Duplicate Defect Reports Using Natural Language Processing , 2007, 29th International Conference on Software Engineering (ICSE'07).

[196]  Fernando Llopis,et al.  A Textual-Based Similarity Approach for Efficient and Scalable External Plagiarism Analysis - Lab Report for PAN at CLEF 2010 , 2010, CLEF.

[197]  Kenneth Ward Church Char_align: A Program for Aligning Parallel Texts at the Character Level , 1993, ACL.

[198]  Stefan Weber,et al.  Das Google-Copy-Paste-Syndrom , 2008 .

[199]  Horacio Rodríguez,et al.  Paraphrase Concept and Typology. A Linguistically Based and Computationally Oriented Approach , 2011, Proces. del Leng. Natural.

[200]  Steven David,et al.  Source Code Authorship Attribution , 2010 .

[201]  Fintan Culwin,et al.  A Visual Argument for Plagiarism Detection using Word Pairs , 2004 .

[202]  Regina Barzilay,et al.  Extracting Paraphrases from a Parallel Corpus , 2001, ACL.

[203]  Peter Jackson,et al.  Natural Language Processing of Online Applications , 2002 .

[204]  Paolo Rosso,et al.  Extracción de Corpus Paralelos de la Wikipedia basada en la Obtención de Alineamientos Bilingües a Nivel de Frase , 2011 .

[205]  Antonio Toral,et al.  A proposal to automatically build and maintain gazetteers for Named Entity Recognition by using Wikipedia , 2006, Workshop On New Text Wikis And Blogs And Other Dynamic Text Sources.

[206]  Alberto Barrón-Cedeño,et al.  Plagiarism Detection across Distant Language Pairs , 2010, COLING.

[207]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[208]  Justin Zobel,et al.  Methods for Identifying Versioned and Plagiarized Documents , 2003, J. Assoc. Inf. Sci. Technol..

[209]  Hans Peter Luhn,et al.  The Automatic Creation of Literature Abstracts , 1958, IBM J. Res. Dev..

[210]  Chris J. Park,et al.  In Other (People's) Words: Plagiarism by university students--literature and lessons , 2003 .

[211]  R. Gunning The Technique of Clear Writing. , 1968 .

[212]  Primoz Skraba,et al.  Cross-lingual document similarity , 2012, Proceedings of the ITI 2012 34th International Conference on Information Technology Interfaces.

[213]  James Mayfield,et al.  Character N-Gram Tokenization for European Language Text Retrieval , 2004, Information Retrieval.

[214]  Jane C. Ginsburg,et al.  Copyright and Piracy: An Interdisciplinary Critique , 2010 .

[215]  James A. Malcolm,et al.  Detecting Short Passages of Similar Text in Large Document Collections , 2001, EMNLP.

[216]  Alexander F. Gelbukh,et al.  PPChecker: Plagiarism Pattern Checker in Document Copy Detection , 2006, TSD.

[217]  Wei-Guang Teng,et al.  Extending Web Search for Online Plagiarism Detection , 2007, 2007 IEEE International Conference on Information Reuse and Integration.

[218]  Flemming Topsøe,et al.  Jensen-Shannon divergence and Hilbert space embedding , 2004, International Symposium onInformation Theory, 2004. ISIT 2004. Proceedings..

[219]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[220]  Alberto Barrón-Cedeño,et al.  Towards the Exploitation of Statistical Language Models for Plagiarism Detection with Reference , 2008, PAN.

[221]  Regina Barzilay,et al.  Learning to Paraphrase: An Unsupervised Approach Using Multiple-Sequence Alignment , 2003, NAACL.

[222]  Ahmet Aker,et al.  Correlation between Similarity Measures for Inter-Language Linked Wikipedia Articles , 2012, LREC.

[223]  Yurii Palkovskii,et al.  Exploring Fingerprinting as External Plagiarism Detection Method - Lab Report for PAN at CLEF 2010 , 2010, CLEF.

[224]  Mirna Adriani,et al.  Automatic External Plagiarism Detection Using Passage Similarities - Lab Report for PAN at CLEF 2010 , 2010, CLEF.

[225]  Jakob Uszkoreit,et al.  Large Scale Parallel Document Mining for Machine Translation , 2010, COLING.

[226]  Richard M. Karp,et al.  Efficient Randomized Pattern-Matching Algorithms , 1987, IBM J. Res. Dev..

[227]  Stan Matwin,et al.  Intrinsic Plagiarism Detection using Complexity Analysis , 2009 .

[228]  Jan Kasprzak,et al.  Improving the Reliability of the Plagiarism Detection System - Lab Report for PAN at CLEF 2010 , 2010, CLEF.

[229]  Bonnie J. Dorr Semantic annotation and lexico-syntactic paraphrase , 2004 .

[230]  David J. Ketchen,et al.  THE APPLICATION OF CLUSTER ANALYSIS IN STRATEGIC MANAGEMENT RESEARCH: AN ANALYSIS AND CRITIQUE , 1996 .

[231]  Renata de Matos Galante,et al.  UFRGS@PAN2010: Detecting External Plagiarism - Lab Report for PAN at CLEF 2010 , 2010, CLEF.

[232]  Jill Burstein,et al.  Opportunities for Natural Language Processing Research in Education , 2009, CICLing.

[233]  Andrei Z. Broder,et al.  On the resemblance and containment of documents , 1997, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171).

[234]  Emanuele Caglioti,et al.  A plagiarism detection procedure in three steps: Selection, matches and squares , 2009 .

[235]  Liwen Vaughan,et al.  Statistical methods for the information professional , 2001 .

[236]  James H. Martin,et al.  Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition , 2000 .

[237]  John D. Lafferty,et al.  Information retrieval as statistical translation , 1999, SIGIR '99.

[238]  Alexander Lindey,et al.  Plagiarism and originality , 1974 .

[239]  Alberto Barrón-Cedeño,et al.  DeSoCoRe: Detecting Source Code Re-Use across Programming Languages , 2012, HLT-NAACL.

[240]  Johannes Gehrke,et al.  Plagiarism Detection in arXiv , 2006, Sixth International Conference on Data Mining (ICDM'06).

[241]  Departmento de Sistemas Informaticos,et al.  On the Mono- and Cross-Language Detection of Text Re-Use and Plagiarism , 2012 .

[242]  Benno Stein,et al.  Applying Hash-based Indexing in Text-based Information Retrieval , 2007 .

[243]  Yurii Palkovskii,et al.  Using WordNet-based Semantic Similarity Measurement in External Plagiarism Detection - Notebook for PAN at CLEF 2011 , 2011, CLEF.

[244]  Bilal Zaka Empowering Plagiarism Detection with a Web Services Enabled Collaborative Network , 2009 .

[245]  D. Holmes A Stylometric Analysis of Mormon Scripture and Related Texts , 1992 .

[246]  Yaakov HaCohen-Kerner,et al.  Detection of Simple Plagiarism in Computer Science Papers , 2010, COLING.