Detecting plagiarism in the forensic linguistics turn

This study investigates plagiarism detection, with an application in forensic contexts. Two types of data were collected for the purposes of this study. Data in the form of written texts were obtained from two Portuguese Universities and from a Portuguese newspaper. These data are analysed linguistically to identify instances of verbatim, morpho-syntactical, lexical and discursive overlap. Data in the form of survey were obtained from two higher education institutions in Portugal, and another two in the United Kingdom. These data are analysed using a 2 by 2 between-groups Univariate Analysis of Variance (ANOVA), to reveal cross-cultural divergences in the perceptions of plagiarism. The study discusses the legal and social circumstances that may contribute to adopting a punitive approach to plagiarism, or, conversely, reject the punishment. The research adopts a critical approach to plagiarism detection. On the one hand, it describes the linguistic strategies adopted by plagiarists when borrowing from other sources, and, on the other hand, it discusses the relationship between these instances of plagiarism and the context in which they appear. A focus of this study is whether plagiarism involves an intention to deceive, and, in this case, whether forensic linguistic evidence can provide clues to this intentionality. It also evaluates current computational approaches to plagiarism detection, and identifies strategies that these systems fail to detect. Specifically, a method is proposed to translingual plagiarism. The findings indicate that, although cross-cultural aspects influence the different perceptions of plagiarism, a distinction needs to be made between intentional and unintentional plagiarism. The linguistic analysis demonstrates that linguistic elements can contribute to finding clues for the plagiarist’s intentionality. Furthermore, the findings show that translingual plagiarism can be detected by using the method proposed, and that plagiarism detection software can be improved using existing computer tools.

[1]  E. Goffman,et al.  Forms of talk , 1982 .

[2]  Z. Dörnyei Research Methods in Applied Linguistics, de Z. Dörnyei , 2010 .

[3]  L. Bently,et al.  Intellectual Property Law , 1997, Law Trove.

[4]  Stephanie Seneff,et al.  TINA: A Natural Language System for Spoken Language Applications , 1992, Comput. Linguistics.

[5]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[6]  Heike Hänlein Studies in authorship recognition: a corpus-based approach , 1999 .

[7]  Maxim Mozgovoy Enhancing Computer-Aided Plagiarism Detection , 2008 .

[8]  王军平,et al.  Translation Studies:翻译学还是译介学? , 2011 .

[9]  Hector Garcia-Molina,et al.  Copy detection mechanisms for digital documents , 1995, SIGMOD '95.

[10]  Hermann Ney,et al.  Improved Alignment Models for Statistical Machine Translation , 1999, EMNLP.

[11]  E. Finegan Language : Its Structure and Use , 1989 .

[12]  Tim Grant,et al.  Quantifying evidence in forensic authorship analysis , 2007 .

[13]  Shelley Angelil-Carter,et al.  Stolen Language?: Plagiarism in Writing , 2000 .

[14]  Máté Pataki A new approach for searching translated plagiarism , 2012 .

[15]  Hermann A. Maurer,et al.  Plagiarism - A Survey , 2006, J. Univers. Comput. Sci..

[16]  Gerald Mcmenamin Forensic Linguistics: Advances in Forensic Stylistics , 2002 .

[17]  G. Verleden,et al.  Foreword , 1992, Steroids.

[18]  Joseph Dainow,et al.  The Civil Law and the Common Law: Some Points of Comparison , 1966 .

[19]  Ron Scollon,et al.  Plagiarism and ideology: Identity in intercultural discourse , 1995, Language in Society.

[20]  Joel Bloch,et al.  Plagiarism, Intellectual Property and the Teaching of L2 Writing , 2012 .

[21]  Acting Intentionally and Acting for a Reason , 2009 .

[22]  周彬彬,et al.  Interlanguage : forty years later , 2014 .

[23]  Roger W. Shuy,et al.  Discourse Analysis in the Legal Context , 2005 .

[24]  N. Entwistle,et al.  Contrasting forms of understanding for degree examinations: the student experience and its implications , 1991 .

[25]  Sean Welsh Cite Them Right: The Essential Referencing Guide , 2012 .

[26]  Alberto Barrón-Cedeño,et al.  On Cross-lingual Plagiarism Analysis using a Statistical Model , 2008, PAN.

[27]  María Teresa Turell Julià Plagio y traducción literaria , 2007 .

[28]  H. Black,et al.  Black's Law Dictionary , 1968 .

[29]  Nicola Lacey,et al.  :Answering for Crime: Responsibility and Liability in the Criminal Law , 2009 .

[30]  Yu Zhang,et al.  Statistical Machine Translation based on LDA , 2010, 2010 4th International Universal Communication Symposium.

[31]  P. Nelde Languages in contact , 1990 .

[32]  Justin Zobel,et al.  Methods for Identifying Versioned and Plagiarized Documents , 2003, J. Assoc. Inf. Sci. Technol..

[33]  Tim D. Grant,et al.  Identifying reliable, valid markers of authorship: a response to Chaski , 2001 .

[34]  Renata de Matos Galante,et al.  A New Approach for Cross-Language Plagiarism Analysis , 2010, CLEF.

[35]  M. Hoey Lexical Priming: A New Theory of Words and Language , 2005 .

[36]  Rebecca Moore Howard,et al.  Plagiarisms, Authorships, and the Academic Death Penalty , 1995, College English.

[37]  Benno Stein,et al.  PAN Plagiarism Corpus PAN-PC-09 , 2009 .

[38]  Graeme Hirst,et al.  Bigrams of Syntactic Labels for Authorship Discrimination of Short Texts , 2007, Lit. Linguistic Comput..

[39]  Moshe Koppel,et al.  Authorship verification as a one-class classification problem , 2004, ICML.

[40]  Shlomo Argamon,et al.  Overview of the International Authorship Identification Competition at PAN-2011 , 2011, CLEF.

[41]  Wilson L. Taylor,et al.  “Cloze Procedure”: A New Tool for Measuring Readability , 1953 .

[42]  Sebastián A. Ríos,et al.  Approaches for Intrinsic and External Plagiarism Detection - Notebook for PAN at CLEF 2011 , 2011, CLEF.

[43]  Heather Fry,et al.  Organizing Teaching and Learning: Outcomes Based Planning , 2003 .

[44]  Hermann Ney,et al.  HMM-Based Word Alignment in Statistical Translation , 1996, COLING.

[45]  Nitin Madnani,et al.  Generating Phrasal and Sentential Paraphrases: A Survey of Data-Driven Methods , 2010, CL.

[46]  Ron Scollon As a matter of fact: The changing ideology of authorship and responsibility in discourse , 1994 .

[47]  M. Cronin Across the lines : travel, language, translation , 2000 .

[48]  B. Werble Outsiders Studies in the Sociology of Deviance. , 1966 .

[49]  Jonathan Slocum,et al.  Machine translation , 1984, Annual Meeting of the Association for Computational Linguistics.

[50]  Boris Katz,et al.  Using Syntactic Information to Identify Plagiarism , 2005 .

[51]  R. Major Losing English as a First Language. , 1992 .

[52]  S. Laviosa Core Patterns of Lexical Use in a Comparable Corpus of English Narrative Prose , 1998 .

[53]  William Hart,et al.  Learning About What Others Were Doing , 2011, Psychological science.

[54]  Participant Roles, Frames, and Speech Acts , 1999 .

[55]  Aneta Pavlenko,et al.  L2 Influence on L1 in Late Bilingualism , 2000 .

[56]  Darnes Vilariño Ayala,et al.  Baseline Approaches for the Authorship Identification Task - Notebook for PAN at CLEF 2011 , 2011, CLEF.

[57]  George K. Mikros,et al.  Authorship Identification in Large Email Collections: Experiments Using Features that Belong to Different Linguistic Levels - Notebook for PAN at CLEF 2011 , 2011, CLEF.

[58]  Gideon Toury,et al.  Descriptive translation studies and beyond , 1995 .

[59]  J. Biggs Student Approaches to Learning and Studying , 1987 .

[60]  Roger Bennett,et al.  Factors associated with student plagiarism in a post‐1992 university , 2005 .

[61]  John M. Swales,et al.  Genre Analysis: English in Academic and Research Settings , 1993 .

[62]  K. D. Hansen,et al.  Academic and scientific misconduct: issues for nursing educators. , 1995, Journal of professional nursing : official journal of the American Association of Colleges of Nursing.

[63]  Benno Stein,et al.  Strategies for retrieving plagiarized documents , 2007, SIGIR.

[64]  Alexandre Libório Dias Pereira Problemas Actuais da Gestão do Direito de Autor: Gestão Individual e Gestão Colectiva do Direito de Autor e os Direitos Conexos na Sociedade da Informação , 2003 .

[65]  Benno Stein,et al.  Cross-language plagiarism detection , 2011, Lang. Resour. Evaluation.

[66]  R. Hughes Considering the Vignette Technique and its Application to a Study of Drug Injecting and HIV Risk and Safer Behaviour , 1998 .

[67]  Udi Manber,et al.  Finding Similar Files in a Large File System , 1994, USENIX Winter.

[68]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[69]  Hsinchun Chen,et al.  Writeprints: A stylometric approach to identity-level identification and similarity detection in cyberspace , 2008, TOIS.

[70]  Candace L. Sidner,et al.  Attention, Intentions, and the Structure of Discourse , 1986, CL.

[71]  R. Wodak,et al.  CRITICAL DISCOURSE ANALYSIS, 4 vols. , 2013 .

[72]  Kim Luyckx,et al.  Authorship Identification of E-mail as a Multi-Class Task - Notebook for PAN at CLEF 2011 , 2011, CLEF.

[73]  Loïc Wacquant,et al.  How penal common sense comes to europeans , 1999 .

[74]  David Woolls Computational forensic linguistics* Searching for similarity in large specialised corpora , 2010 .

[75]  Philipp Koehn,et al.  Improved Statistical Machine Translation Using Paraphrases , 2006, NAACL.

[76]  Benno Stein,et al.  Fuzzy-Fingerprints for Text-Based Information Retrieval , 2005 .

[77]  Edward Finegan,et al.  13. LINGUISTIC PRESCRIPTION: FAMILIAR PRACTICES AND NEW PERSPECTIVES , 2003, Annual Review of Applied Linguistics.

[78]  Marta Recasens,et al.  On Paraphrase and Coreference , 2010, Computational Linguistics.

[79]  M. Coulthard Author Identification, Idiolect, and Linguistic Uniqueness. , 2004 .

[80]  Shlomo Argamon,et al.  Automatically profiling the author of an anonymous text , 2009, CACM.

[81]  J. Muncie The `Punitive Turn' in Juvenile Justice: Cultures of Control and Rights Compliance in Western Europe and the USA , 2008 .

[82]  R. Barthes,et al.  Image-Music-Text , 1977 .

[83]  Ae Griffin,et al.  Principles of Assessment , 2013 .

[84]  Richard Burns,et al.  Business Research Methods and Statistics Using SPSS , 2008 .

[85]  Paolo Rosso,et al.  3 Intrinsic Plagiarism Detection in Arabic Text , 2014 .

[86]  Lily Wong Fillmore When Learning a Second Language Means Losing the First. , 1991 .

[87]  Alberto Barrón-Cedeño,et al.  Plagiarism Detection across Distant Language Pairs , 2010, COLING.

[88]  Matthias Hagen,et al.  Overview of the 1st international competition on plagiarism detection , 2009 .

[89]  Alexander Lindey,et al.  Plagiarism and originality , 1974 .

[90]  Efstathios Stamatatos,et al.  Author Identification Using Imbalanced and Limited Training Texts , 2007, 18th International Workshop on Database and Expert Systems Applications (DEXA 2007).

[91]  Karel Jezek,et al.  Multilingual Plagiarism Detection , 2008, AIMSA.

[92]  S. Goodman,et al.  Plagiarism , 2012, Experimental biology and medicine.

[93]  J. Flowerdew Definitions in Science Lectures , 1992 .

[94]  Efstathios Stamatatos,et al.  Intrinsic Plagiarism Detection Using Character n-gram Profiles , 2009 .

[95]  David Woolls Better tools for the trade and how to use them , 2003 .

[96]  Horacio Rodríguez,et al.  Paraphrase Concept and Typology. A Linguistically Based and Computationally Oriented Approach , 2011, Proces. del Leng. Natural.

[97]  Daphne A. Jameson The Ethics of Plagiarism: How Genre Affects Writers' Use of Source Materials , 1993 .

[98]  M. Tonry Symbol, Substance, and Severity in Western Penal Policies , 2001 .

[99]  Alberto Barrón-Cedeño,et al.  On Automatic Plagiarism Detection Based on n-Grams Comparison , 2009, ECIR.

[100]  Sven Meyer Genre Classification of Web Pages User Study and Feasibility Analysis , 2004 .

[101]  Rebecca Moore Howard,et al.  Standing in the Shadow of Giants: Plagiarists, Authors, Collaborators. Perspectives on Writing: Theory, Research, Practice. Volume 2. , 1999 .

[102]  E. Stamatatos Ensemble-based Author Identification Using Character N-grams , 2006 .

[103]  Robert M. Vago,et al.  First language attrition: The study of first language attrition: an overview , 1991 .

[104]  M. Teresa,et al.  Textual kidnapping revisited: the case of plagiarism in literary translation , 2004 .

[105]  Stanley H. Cohen Folk Devils and Moral Panics , 1972 .

[106]  Parvati Iyer,et al.  Document Similarity Analysis for a Plagiarism Detection System , 2005, IICAI.

[108]  Sung-Hyuk Cha Comprehensive Survey on Distance/Similarity Measures between Probability Density Functions , 2007 .

[109]  Usha Lakshmanan,et al.  Language Transfer And Fossilization: The “Multiple Effects Principle” , 1992 .

[110]  Benno Stein,et al.  Near Similarity Search and Plagiarism Analysis , 2005, GfKl.

[111]  Shlomo Argamon,et al.  Stylistic text classification using functional lexical features: Research Articles , 2007 .

[112]  Fabiola Estrada,et al.  Juvenile Violence as a Social Problem : Trends, Media Attention and Societal Response , 2001 .

[113]  Diane Pecorari,et al.  Academic Writing and Plagiarism: A Linguistic Analysis , 2008 .

[114]  N. Fairclough,et al.  Critical Discourse Analysis: The Critical Study of Language , 1995 .

[115]  Efstathios Stamatatos,et al.  A survey of modern authorship attribution methods , 2009, J. Assoc. Inf. Sci. Technol..

[116]  Justin Zobel,et al.  A Scalable System for Identifying Co-derivative Documents , 2004, SPIRE.

[117]  Csr Young,et al.  How to Do Things With Words , 2009 .

[118]  M. Foucault Authorship: What is an Author? , 1979 .

[119]  Miguel Roig,et al.  Plagiarism and Paraphrasing Criteria of College and University Professors , 2001 .

[120]  Edwin Gentzler,et al.  Contemporary Translation Theories , 1993 .

[121]  Emiel Krahmer,et al.  Paraphrasing Headlines by Machine Translation Sentential Paraphrase Acquisition and Generation using Google News , 2011 .

[122]  Thamar Solorio,et al.  Authorship Identification with Modality Specific Meta Features - Notebook for PAN at CLEF 2011 , 2011, CLEF.

[123]  David Woolls,et al.  Tools for the Trade , 1998 .

[124]  When is a Translation Not a Translation? , 1958 .