A study on plagiarism detection and plagiarism direction identification using natural language processing techniques

Man Yan Miranda Chong A thesis submitted in partial fulfilment of the requirements of the University of Wolverhampton for the degree of Doctor of Philosophy 2013

[1]  Chris Fox,et al.  The Influence of Text Pre-processing on Plagiarism Detection , 2009, RANLP.

[2]  Arkady B. Zaslavsky,et al.  Document overlap detection system for distributed digital libraries , 2000, DL '00.

[3]  Jeremy B. Williams,et al.  The plagiarism problem: are students entirely to blame? , 2002, ASCILITE.

[4]  Roman Kern,et al.  External and Intrinsic Plagiarism Detection Using a Cross-Lingual Retrieval and Segmentation System - Lab Report for PAN at CLEF 2010 , 2010, CLEF.

[5]  Michael J. Wise,et al.  YAP3: improved detection of similarities in computer program and other texts , 1996, SIGCSE '96.

[6]  Paolo Rosso,et al.  Determining and characterizing the reused text for plagiarism detection , 2013, Expert Syst. Appl..

[7]  Maxim Mozgovoy Enhancing Computer-Aided Plagiarism Detection , 2008 .

[8]  Lucia Specia,et al.  Lexical Generalisation for Word-level Matching in Plagiarism Detection , 2011, RANLP.

[9]  Kenneth Heafield,et al.  KenLM: Faster and Smaller Language Model Queries , 2011, WMT@EMNLP.

[10]  Paul Clough,et al.  Old and new challenges in automatic plagiarism detection , 2003 .

[11]  Raymond K. Wong,et al.  Shallow NLP techniques for internet search , 2006, ACSC.

[12]  Efstathios Stamatatos The Class Imbalance Problem in Author Identification , 2007, PAN.

[13]  Haifeng Wang,et al.  Extracting paraphrase patterns from bilingual parallel corpora , 2009, Natural Language Engineering.

[14]  Moshe Koppel,et al.  Translationese and Its Dialects , 2011, ACL.

[15]  Dan Klein,et al.  Accurate Unlexicalized Parsing , 2003, ACL.

[16]  Alberto Barrón-Cedeño,et al.  Towards the 2nd International Competition on Plagiarism Detection and Beyond , 2010 .

[17]  Fintan Culwin,et al.  A Visual Argument for Plagiarism Detection using Word Pairs , 2004 .

[18]  Alberto Barrón-Cedeño,et al.  Plagiarism Meets Paraphrasing: Insights for the Next Generation in Automatic Plagiarism Detection , 2013, CL.

[19]  Per Runeson,et al.  Detection of Duplicate Defect Reports Using Natural Language Processing , 2007, 29th International Conference on Software Engineering (ICSE'07).

[20]  Ido Dagan,et al.  The Third PASCAL Recognizing Textual Entailment Challenge , 2007, ACL-PASCAL@ACL.

[21]  Patricio Martínez-Barco,et al.  An Architecture for Spoken Document Retrieval , 2004, TSD.

[22]  Shlomo Argamon,et al.  Computational methods in authorship attribution , 2009, J. Assoc. Inf. Sci. Technol..

[23]  Hector Garcia-Molina,et al.  Copy detection mechanisms for digital documents , 1995, SIGMOD '95.

[24]  David Sharp,et al.  Technical Review of Plagiarism Detection Software Report , 2001 .

[25]  Benno Stein,et al.  Proceedings of the ECAI'08 Workshop on Uncovering Plagiarism, Authorship and Social Software Misuse, Patras, Greece, July 22, 2008 , 2008, PAN.

[26]  Hao-Ren Ke,et al.  Plagiarism Detection using ROUGE and WordNet , 2010, ArXiv.

[27]  Robert Williams The power of normalised word vectors for automatically grading essays , 2006 .

[28]  Rada Mihalcea,et al.  Measuring the Semantic Similarity of Texts , 2005, EMSEE@ACL.

[29]  Mike Joy,et al.  Sentence-based natural language plagiarism detection , 2004, JERC.

[30]  Michael J. Wise,et al.  Running Karp-Rabin Matching and Greedy String Tiling , 2003 .

[31]  Alberto Barrón-Cedeño,et al.  A statistical approach to crosslingual natural language tasks , 2008, LA-NMR.

[32]  Cristian Grozea,et al.  ENCOPLOT: Pairwise Sequence Matching in Linear Time Applied to Plagiarism Detection ∗ , 2009 .

[33]  Boris Katz,et al.  Lexical Chains and Sliding Locality Windows in Content-based Text Similarity Detection , 2005, IJCNLP.

[34]  Dragos Stefan Munteanu,et al.  Improving Machine Translation Performance by Exploiting Non-Parallel Corpora , 2005, CL.

[35]  Jan Kasprzak,et al.  Improving the Reliability of the Plagiarism Detection System - Lab Report for PAN at CLEF 2010 , 2010, CLEF.

[36]  Fintan Culwin,et al.  Classifications of plagiarism detection engines , 2005 .

[37]  Elad Yom-Tov,et al.  Serial Sharers: Detecting Split Identities of Web Authors , 2007, PAN.

[38]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[39]  Justin Zobel,et al.  Methods for Identifying Versioned and Plagiarized Documents , 2003, J. Assoc. Inf. Sci. Technol..

[40]  Yuen-Yan Chan,et al.  A natural language processing approach to automatic plagiarism detection , 2007, SIGITE '07.

[41]  James A. Malcolm,et al.  Detecting Short Passages of Similar Text in Large Document Collections , 2001, EMNLP.

[42]  Alexander F. Gelbukh,et al.  PPChecker: Plagiarism Pattern Checker in Document Copy Detection , 2006, TSD.

[43]  Xiao-Dong Liu,et al.  Finding Plagiarism Based on Common Semantic Sequence Model , 2004, WAIM.

[44]  Hector Garcia-Molina,et al.  Building a scalable and accurate copy detection mechanism , 1996, DL '96.

[45]  Mark Stevenson,et al.  University of Sheffield - Lab Report for PAN at CLEF 2010 , 2010, CLEF.

[46]  Lisette García Moya,et al.  Translation universals : do they exist ? A corpus-based and NLP approach to convergence , 2008 .

[47]  Graeme Hirst,et al.  Authorship Attribution for Small Texts: Literary and Forensic Experiments , 2007, PAN.

[48]  Shuly Wintner,et al.  Adapting Translation Models to Translationese Improves SMT , 2012, EACL.

[49]  Naomie Salim,et al.  Fuzzy Semantic-Based String Similarity for Extrinsic Plagiarism Detection - Lab Report for PAN at CLEF 2010 , 2010, CLEF.

[50]  Lucia Specia,et al.  Using Natural Language Processing for Automatic Detection of Plagiarism , 2010 .

[51]  Dan Klein,et al.  Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network , 2003, NAACL.

[52]  Sotiris B. Kotsiantis,et al.  Supervised Machine Learning: A Review of Classification Techniques , 2007, Informatica.

[53]  Radim Řehůřek Semantic-based plagiarism detection , 2008 .

[54]  Jack Grieve,et al.  Quantitative Authorship Attribution: An Evaluation of Techniques , 2007, Lit. Linguistic Comput..

[55]  Hsinchun Chen,et al.  A framework for authorship identification of online messages: Writing-style features and classification techniques , 2006 .

[56]  Janis Grundspenkis,et al.  Computer-based plagiarism detection methods and tools: an overview , 2007, CompSysTech '07.

[57]  Benno Stein,et al.  Plagiarism analysis, authorship identification, and near-duplicate detection PAN'07 , 2007, SIGF.

[58]  Alon Lavie,et al.  Meteor 1.3: Automatic Metric for Reliable Optimization and Evaluation of Machine Translation Systems , 2011, WMT@EMNLP.

[59]  Erich Steiner,et al.  Linguistics and Cultural Studies: Complementary or Competing Paradigms in Translation Studies? , 1996 .

[60]  Roberto Basili,et al.  Semantic Tree Kernels to Classify Predicate Argument Structures , 2006, ECAI.

[61]  Rodolfo Delmonte VENSES - A Linguistically-Based System for Semantic Evaluation , 2005, Proces. del Leng. Natural.

[62]  Eneko Agirre,et al.  SemEval-2012 Task 6: A Pilot on Semantic Textual Similarity , 2012, *SEMEVAL.

[63]  François Yvon,et al.  Detecting Fake Content with Relative Entropy Scoring , 2008, PAN.

[64]  Christian Hardmeier Improving Machine Translation Quality Prediction with Syntactic Tree Kernels , 2011, EAMT.

[65]  Benno Stein,et al.  Meta Analysis within Authorship Verification , 2008, 2008 19th International Workshop on Database and Expert Systems Applications.

[66]  Kevin R. Parker,et al.  Use of the Normalized Word Vector Approach in Document Classification for an LKMC , 2008 .

[67]  Robert John Evans,et al.  Evaluating an electronic plagiarism detection service , 2006 .

[68]  Robert J. Gaizauskas,et al.  Building and annotating a corpus for the study of journalistic text reuse , 2002, LREC.

[69]  Hermann A. Maurer,et al.  Plagiarism - A Problem And How To Fight It , 2007 .

[70]  George K. Mikros,et al.  Investigating Topic Influence in Authorship Attribution , 2007, PAN.

[71]  Martin Gellerstam,et al.  Translationese in Swedish novels translated from English , 1986 .

[72]  Fintan Culwin,et al.  Visualising intra-corpal plagiarism , 2001, Proceedings Fifth International Conference on Information Visualisation.

[73]  Fuchun Peng,et al.  N-GRAM-BASED AUTHOR PROFILES FOR AUTHORSHIP ATTRIBUTION , 2003 .

[74]  Maria Soledad Pera,et al.  SimPaD: A word-similarity sentence-based plagiarism detection tool on Web documents , 2011, Web Intell. Agent Syst..

[75]  Gunnar Eriksson,et al.  Authors, Genre, and Linguistic Convention , 2007, PAN.

[76]  Bill Marsh Turnitin.com and the scriptural enterprise of plagiarism detection , 2004 .

[77]  Zdenek Ceska THE FUTURE OF COPY DETECTION TECHNIQUES , 2007 .

[78]  Stephan Bloehdorn,et al.  Combined Syntactic and Semantic Kernels for Text Classification , 2007, ECIR.

[79]  K. J. Ottenstein An algorithmic approach to the detection and prevention of plagiarism , 1976, SGCS.

[80]  Alessandro Moschitti,et al.  Making Tree Kernels Practical for Natural Language Learning , 2006, EACL.

[81]  Thomas Lancaster Using freely available tools to produce a partially automated plagiarism detection process , 2004 .

[82]  Debora Weber-Wulff,et al.  Test cases for plagiarism detection software , 2010 .

[83]  Grace Hui Yang,et al.  Near-duplicate detection by instance-level constrained clustering , 2006, SIGIR.

[84]  Boris Katz,et al.  Using Syntactic Information to Identify Plagiarism , 2005 .

[85]  Máté Pataki Distributed similarity and plagiarism search , 2006 .

[86]  Dararat Khampusaen Dealing with Plagiarism in the Digital Age , 2015 .

[87]  Yorick Wilks,et al.  The METER corpus : a corpus for analysing journalistic text reuse , 2001 .

[88]  Gurmeet Singh Manku,et al.  Detecting near-duplicates for web crawling , 2007, WWW '07.

[89]  Shuly Wintner,et al.  Language Models for Machine Translation: Original vs. Translated Texts , 2011, CL.

[90]  Heinz Dreher,et al.  Issues in Informing Science and Information Technology Automatic Conceptual Analysis for Plagiarism Detection , 2022 .

[92]  Hector Garcia-Molina,et al.  SCAM: A Copy Detection Mechanism for Digital Documents , 1995, DL.

[93]  Benno Stein,et al.  An Evaluation Framework for Plagiarism Detection , 2010, COLING.

[94]  Hwan-Gue Cho,et al.  Detecting and tracing plagiarized documents by reconstruction plagiarism-evolution tree , 2008, 2008 8th IEEE International Conference on Computer and Information Technology.

[95]  James A. Malcolm,et al.  Plagiarism is Easy, but also Easy To Detect , 2006 .

[96]  Diana Inkpen,et al.  Identification of Translationese: A Machine Learning Approach , 2010, CICLing.

[97]  Rogelio Nazar,et al.  An Extremely Simple Authorship Attribution System , 2006 .

[98]  Bruno Pouliquen,et al.  Automatic Identification of Document Translations in Large Multilingual Document Collections , 2006, ArXiv.

[99]  Alberto Barrón-Cedeño,et al.  On Cross-lingual Plagiarism Analysis using a Statistical Model , 2008, PAN.

[100]  Alessandro Moschitti,et al.  Efficient Convolution Kernels for Dependency and Constituent Syntactic Trees , 2006, ECML.

[101]  Benno Stein,et al.  Strategies for retrieving plagiarized documents , 2007, SIGIR.

[102]  Paul Clough,et al.  Plagiarism in natural and programming languages: an overview of current tools and technologies , 2000 .

[103]  Mark Stevenson,et al.  Developing a corpus of plagiarised short answers , 2011, Lang. Resour. Evaluation.

[104]  Alberto Barrón-Cedeño,et al.  On Automatic Plagiarism Detection Based on n-Grams Comparison , 2009, ECIR.

[105]  Thorsten Joachims,et al.  Training linear SVMs in linear time , 2006, KDD '06.

[106]  Moshe Koppel,et al.  Authorship verification as a one-class classification problem , 2004, ICML.

[107]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[108]  Benno Stein,et al.  Intrinsic Plagiarism Analysis with Meta Learning , 2007, PAN.

[109]  Benno Stein,et al.  Intrinsic Plagiarism Detection , 2006, ECIR.

[110]  Efstathios Stamatatos,et al.  A survey of modern authorship attribution methods , 2009, J. Assoc. Inf. Sci. Technol..

[111]  Yorick Wilks,et al.  Measuring Text Reuse , 2002, ACL.

[112]  Hans van Halteren,et al.  Source Language Markers in EUROPARL Translations , 2008, COLING.

[113]  Ion Androutsopoulos,et al.  A Survey of Paraphrasing and Textual Entailment Methods , 2009, J. Artif. Intell. Res..

[114]  Stefan Gruner,et al.  Tool support for plagiarism detection in text documents , 2005, SAC '05.

[115]  Máté Pataki,et al.  Comparison of Overlap Detection Techniques , 2002, International Conference on Computational Science.

[116]  Hayato Yamana,et al.  EPCI: extracting potentially copyright infringement texts from the web , 2007, WWW '07.

[117]  Benno Stein,et al.  Cross-language plagiarism detection , 2011, Lang. Resour. Evaluation.

[118]  Pavol Návrat,et al.  Support for checking plagiarism in e-learning , 2010 .

[119]  Jeffrey Xu Yu,et al.  Efficient similarity joins for near-duplicate detection , 2011, TODS.

[120]  Lucia Specia,et al.  Linguistic and Statistical Traits Characterising Plagiarism , 2012, COLING.

[121]  Iryna Gurevych,et al.  UKP: Computing Semantic Textual Similarity by Combining Multiple Content Similarity Measures , 2012, *SEMEVAL.

[122]  Hermann A. Maurer,et al.  Plagiarism - A Survey , 2006, J. Univers. Comput. Sci..

[123]  Tom M. Mitchell,et al.  Machine Learning and Data Mining , 2012 .

[124]  C. Lyon,et al.  Demonstration of the Ferret Plagiarism Detector , 2006 .

[125]  Eugene Creswick,et al.  Pedigree Tracking in the Face of Ancillary Content , 2008, PAN.

[126]  M. Mozgovoy The Use of Machine Semantic Analysis in Plagiarism Detection , 2006 .

[127]  Jakob Grue Simonsen,et al.  Lost in Translation: Authorship Attribution using Frame Semantics , 2011, ACL.

[128]  A. Lathrop,et al.  Student Cheating and Plagiarism in the Internet Era: A Wake-Up Call , 2000 .

[129]  Matthias Hagen,et al.  Overview of the 1st international competition on plagiarism detection , 2009 .

[130]  Jason S. Chang,et al.  Computer Assisted Language Learning Based on Corpora and Natural Language Processing: The Experience of Project CANDLE , 2004 .

[131]  Johannes Gehrke,et al.  Plagiarism Detection in arXiv , 2006, Sixth International Conference on Data Mining (ICDM'06).

[132]  Karel Jezek,et al.  Multilingual Plagiarism Detection , 2008, AIMSA.

[133]  William H. Angoff,et al.  The Development of Statistical Indices for Detecting Cheaters , 1974 .

[134]  Erkki Sutinen,et al.  Using natural language parsers in plagiarism detection , 2007, SLaTE.

[135]  Paolo Rosso,et al.  Authorship Attribution Using Word Sequences , 2006, CIARP.

[136]  Zhang Ling,et al.  A Cluster-Based Plagiarism Detection Method - Lab Report for PAN at CLEF 2010 , 2010, CLEF.

[137]  Christopher D. Manning,et al.  Generating Typed Dependency Parses from Phrase Structure Parses , 2006, LREC.

[138]  Mona Baker,et al.  'Corpus Linguistics and Translation Studies: Implications and Applications' , 1993 .

[139]  Massimo Moneglia,et al.  Plagiarism Detection through Multilevel Text Comparison , 2006, 2006 Second International Conference on Automated Production of Cross Media Content for Multi-Channel Distribution (AXMEDIS'06).

[140]  Cordelia Schmid,et al.  The 2005 PASCAL Visual Object Classes Challenge , 2005, MLCW.

[141]  Alberto Barrón-Cedeño,et al.  On the mono- and cross-language detection of text reuse and plagiarism , 2010, Proces. del Leng. Natural.

[142]  Karel Jezek,et al.  Extending the single words-based document model: a comparison of bigrams and 2-itemsets , 2006, DocEng '06.

[143]  Diana Inkpen,et al.  Translationese Traits in Romanian Newspapers: A Machine Learning Approach , 2011, Int. J. Comput. Linguistics Appl..