Towards the Exploitation of Statistical Language Models for Plagiarism Detection with Reference

To plagiarise is to robe credit of another person's work. Particularly, plagiarism in text means including text fragments (and even an entire document) from an author without giving him the correspondent credit. In this work we describe our first attempt to detect plagiarised segments in a text employing statistical Language Models (LMs) and perplexity. The preliminary experiments, carried out on two specialised and literary corpora (including original, part-of-speech and stemmed versions), show that perplexity of a text segment, given a Language Model calculated over an author text, could be a relevant feature in plagiarism detection.

[1]  Helmut Schmidt,et al.  Probabilistic part-of-speech tagging using decision trees , 1994 .

[2]  Hermann Ney,et al.  N-Gram Posterior Probabilities for Statistical Machine Translation , 2006, WMT@HLT-NAACL.

[3]  Paolo Rosso,et al.  Authorship Attribution Using Word Sequences , 2006, CIARP.

[4]  Rynson W. H. Lau,et al.  CHECK: a document plagiarism detection system , 1997, SAC '97.

[5]  Thomas M. Breuel,et al.  The OCRopus open source OCR system , 2008, Electronic Imaging.

[6]  Benno Stein Principles of hash-based text retrieval , 2007, SIGIR.

[7]  José B. Mariño,et al.  An n-gram-based statistical machine translation decoder , 2005, INTERSPEECH.

[8]  Djoerd Hiemstra,et al.  A Linguistically Motivated Probabilistic Model of Information Retrieval , 1998, ECDL.

[9]  Frederick Jelinek,et al.  Statistical methods for speech recognition , 1997 .

[10]  Benno Stein,et al.  Intrinsic Plagiarism Analysis with Meta Learning , 2007, PAN.

[11]  Parvati Iyer,et al.  Document Similarity Analysis for a Plagiarism Detection System , 2005, IICAI.

[12]  Stefanos Gritzalis,et al.  Identifying Authorship by Byte-Level N-Grams: The Source Code Author Profile (SCAP) Method , 2007, Int. J. Digit. EVid..

[13]  Verónica Romero,et al.  Combination of N-Grams and Stochastic Context-Free Grammars in an Offline Handwritten Recognition System , 2007, IbPRIA.

[14]  Dale Schuurmans,et al.  Language independent authorship attribution using character level language models , 2003, Proceedings of the tenth conference on European chapter of the Association for Computational Linguistics - EACL '03.

[15]  Benno Stein,et al.  Intrinsic Plagiarism Detection , 2006, ECIR.