Intrinsic plagiarism analysis

Research in automatic text plagiarism detection focuses on algorithms that compare suspicious documents against a collection of reference documents. Recent approaches perform well in identifying copied or modified foreign sections, but they assume a closed world where a reference collection is given. This article investigates the question whether plagiarism can be detected by a computer program if no reference can be provided, e.g., if the foreign sections stem from a book that is not available in digital form. We call this problem class intrinsic plagiarism analysis; it is closely related to the problem of authorship verification. Our contributions are threefold. (1) We organize the algorithmic building blocks for intrinsic plagiarism analysis and authorship verification and survey the state of the art. (2) We show how the meta learning approach of Koppel and Schler, termed “unmasking”, can be employed to post-process unreliable stylometric analysis results. (3) We operationalize and evaluate an analysis chain that combines document chunking, style model computation, one-class classification, and meta learning.

[1]  Patrick Juola,et al.  Authorship Attribution , 2008, Found. Trends Inf. Retr..

[2]  Shlomo Argamon,et al.  Style mining of electronic messages for multiple authorship discrimination: first results , 2003, KDD '03.

[3]  G. Yule,et al.  The statistical study of literary vocabulary , 1944 .

[4]  Simon Günter,et al.  On Authorship Attribution via Markov Chains and Sequence Kernels , 2006, 18th International Conference on Pattern Recognition (ICPR'06).

[5]  Hans Van Halteren,et al.  Author verification by linguistic profiling: An exploration of the parameter space , 2007, TSLP.

[6]  Graeme Hirst,et al.  Segmenting a document by stylistic character , 2003 .

[7]  Hans van Halteren,et al.  Linguistic Profiling for Authorship Recognition and Verification , 2004, ACL.

[8]  Moshe Koppel,et al.  Measuring Differentiability: Unmasking Pseudonymous Authors , 2007, J. Mach. Learn. Res..

[9]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[10]  Mark Stefik,et al.  Introduction to knowledge systems , 1995 .

[11]  Andrei Z. Broder,et al.  Indexing Shared Content in Information Retrieval Systems , 2006, EDBT.

[12]  Mitchell P. Marcus,et al.  Topic segmentation: algorithms and applications , 1998 .

[13]  Benno Stein Principles of hash-based text retrieval , 2007, SIGIR.

[14]  Hector Garcia-Molina,et al.  Copy detection mechanisms for digital documents , 1995, SIGMOD '95.

[15]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[16]  Peter Norvig,et al.  Artificial Intelligence: A Modern Approach , 1995 .

[17]  Malik Yousef,et al.  One-Class SVMs for Document Classification , 2002, J. Mach. Learn. Res..

[18]  Shlomo Argamon,et al.  Computational methods in authorship attribution , 2009, J. Assoc. Inf. Sci. Technol..

[19]  Jon M. Kleinberg,et al.  Two algorithms for nearest-neighbor search in high dimensions , 1997, STOC '97.

[20]  R. P. Fishburne,et al.  Derivation of New Readability Formulas (Automated Readability Index, Fog Count and Flesch Reading Ease Formula) for Navy Enlisted Personnel , 1975 .

[21]  Moshe Koppel,et al.  Exploiting Stylistic Idiosyncrasies for Authorship Attribution , 2003 .

[22]  Carole E. Chaski,et al.  Who's At The Keyboard? Authorship Attribution in Digital Evidence Investigations , 2005, Int. J. Digit. EVid..

[23]  清川 英男,et al.  CHALL, J. S. and DALE, E. (1995) Readability Revisited : The New Dale-Chall Readability Formula., Brookline Books , 1996 .

[24]  Benno Stein,et al.  Intrinsic Plagiarism Detection , 2006, ECIR.

[25]  Efstathios Stamatatos,et al.  A survey of modern authorship attribution methods , 2009, J. Assoc. Inf. Sci. Technol..

[26]  Simon Günter,et al.  Short Text Authorship Attribution via Sequence Kernels, Markov Chains and Author Unmasking: An Investigation , 2006, EMNLP.

[27]  Justin Zobel,et al.  A Scalable System for Identifying Co-derivative Documents , 2004, SPIRE.

[28]  Shlomo Argamon,et al.  Authorship attribution with thousands of candidate authors , 2006, SIGIR.

[29]  Moshe Koppel,et al.  Authorship verification as a one-class classification problem , 2004, ICML.

[30]  Benno Stein,et al.  Intrinsic Plagiarism Analysis with Meta Learning , 2007, PAN.

[31]  D. Holmes The Evolution of Stylometry in Humanities Scholarship , 1998 .

[32]  Piotr Indyk,et al.  Approximate nearest neighbors: towards removing the curse of dimensionality , 1998, STOC '98.

[33]  Benno Stein,et al.  Genre classification of Web pages user study and feasibility analysis , 2004 .

[34]  Justin Zobel,et al.  Methods for Identifying Versioned and Plagiarized Documents , 2003, J. Assoc. Inf. Sci. Technol..

[35]  David M. J. Tax,et al.  One-class classification , 2001 .

[36]  R. Gunning The Technique of Clear Writing. , 1968 .

[37]  Efstathios Stamatatos,et al.  Computer-Based Authorship Attribution Without Lexical Measures , 2001, Comput. Humanit..

[38]  Ophir Frieder,et al.  Discrimination of Authorship Using Visualization , 1994, Inf. Process. Manag..

[39]  Benno Stein,et al.  Meta Analysis within Authorship Verification , 2008, 2008 19th International Workshop on Database and Expert Systems Applications.

[40]  Mikhail B. Malyutov Authorship Attribution of Texts: A Review , 2006, GTIT-C.

[41]  Benno Stein,et al.  Plagiarism Detection Without Reference Collections , 2006, GfKl.

[42]  J. Chall,et al.  A FORMULA FOR PREDICTING READABILITY , 1948 .

[43]  Graeme Hirst,et al.  Segmenting documents by stylistic character , 2005, Natural Language Engineering.

[44]  Benno Stein,et al.  Fuzzy-Fingerprints for Text-Based Information Retrieval , 2005 .

[45]  Grace Hui Yang,et al.  Near-duplicate detection by instance-level constrained clustering , 2006, SIGIR.

[46]  Robert P. W. Duin,et al.  Combining One-Class Classifiers , 2001, Multiple Classifier Systems.

[47]  Benno Stein,et al.  An MDA Approach to Implement Personal IR Tools , 2007 .

[48]  R. Flesch A new readability yardstick. , 1948, The Journal of applied psychology.

[49]  Luiz Eduardo Soares de Oliveira,et al.  Using Conjunctions and Adverbs for Author Verification , 2008, J. Univers. Comput. Sci..

[50]  Michael Gamon,et al.  Obfuscating Document Stylometry to Preserve Author Anonymity , 2006, ACL.

[51]  Piotr Indyk,et al.  Approximate Nearest Neighbor: Towards Removing the Curse of Dimensionality , 2012, Theory Comput..

[52]  Joseph Rudman,et al.  The State of Authorship Attribution Studies: Some Problems and Solutions , 1997, Comput. Humanit..

[53]  Freddy Y. Y. Choi Advances in domain independent linear text segmentation , 2000, ANLP.

[54]  Monika Henzinger,et al.  Finding near-duplicate web pages: a large-scale evaluation of algorithms , 2006, SIGIR.

[55]  David I. Holmes,et al.  An Assessment of Cumulative Sum Charts for Authorship Attribution , 1993 .

[56]  Efstathios Stamatatos,et al.  Author Identification Using Imbalanced and Limited Training Texts , 2007, 18th International Workshop on Database and Expert Systems Applications (DEXA 2007).

[57]  Arkady B. Zaslavsky,et al.  Signature Extraction for Overlap Detection in Documents , 2002, ACSC.

[58]  Piotr Indyk,et al.  Similarity Search in High Dimensions via Hashing , 1999, VLDB.

[59]  Gunnar Rätsch,et al.  Constructing Boosting Algorithms from SVMs: An Application to One-Class Classification , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[60]  Rong Zheng,et al.  A framework for authorship identification of online messages: Writing-style features and classification techniques , 2006, J. Assoc. Inf. Sci. Technol..

[61]  J. Chall,et al.  Readability revisited : the new Dale-Chall readability formula , 1995 .

[62]  R. Harald Baayen,et al.  How Variable May a Constant be? Measures of Lexical Richness in Perspective , 1998, Comput. Humanit..