A simple and efficient algorithm for authorship verification

This paper describes and evaluates an unsupervised and effective authorship verification model called Spatium‐L1. As features, we suggest using the 200 most frequent terms of the disputed text (isolated words and punctuation symbols). Applying a simple distance measure and a set of impostors, we can determine whether or not the disputed text was written by the proposed author. Moreover, based on a simple rule we can define when there is enough evidence to propose an answer or when the attribution scheme is unable to make a decision with a high degree of certainty. Evaluations based on 6 test collections (PAN CLEF 2014 evaluation campaign) indicate that Spatium‐L1 usually appears in the top 3 best verification systems, and on an aggregate measure, presents the best performance. The suggested strategy can be adapted without any problem to different Indo‐European languages (such as English, Dutch, Spanish, and Greek) or genres (essay, novel, review, and newspaper article).

[1]  D. Holmes The Evolution of Stylometry in Humanities Scholarship , 1998 .

[2]  Robert C. Holte,et al.  Very Simple Classification Rules Perform Well on Most Commonly Used Datasets , 1993, Machine Learning.

[3]  Yaron Winter Determining if Two Documents are by the Same Author , 2013 .

[4]  Youssef Iraqi,et al.  A Slightly-modified GI-based Author-verifier with Lots of Features (ASGALF) , 2014, CLEF.

[5]  Richard A. Harshman,et al.  Indexing by latent semantic indexing , 1990 .

[6]  Elisabeth Dévière,et al.  Analyzing linguistic data: a practical introduction to statistics using R , 2009 .

[7]  John Nerbonne,et al.  The Secret Life of Pronouns. What Our Words Say About Us , 2014, Lit. Linguistic Comput..

[8]  Jacques Savoy,et al.  Estimating the probability of an authorship attribution , 2016, J. Assoc. Inf. Sci. Technol..

[9]  Jacques Savoy,et al.  Comparative evaluation of term selection functions for authorship attribution , 2015, Digit. Scholarsh. Humanit..

[10]  Moshe Koppel,et al.  Determining if two documents are written by the same author , 2014, J. Assoc. Inf. Sci. Technol..

[11]  David J. Hand,et al.  Classifier Technology and the Illusion of Progress , 2006, math/0606441.

[12]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[13]  David L. Hoover,et al.  Testing Burrows's Delta , 2004, Lit. Linguistic Comput..

[14]  Dominique Labbé,et al.  Experiments on authorship attribution by intertextual distance in english* , 2007, J. Quant. Linguistics.

[15]  Efstathios Stamatatos,et al.  Overview of the Author Identification Task at PAN 2013 , 2013, CLEF.

[16]  David J. Groggel,et al.  Practical Nonparametric Statistics , 2000, Technometrics.

[17]  Jacques Savoy,et al.  Authorship Attribution Based on Specific Vocabulary , 2012, TOIS.

[18]  Maciej Eder,et al.  Does size matter? Authorship attribution, small samples, big problem , 2015, Digit. Scholarsh. Humanit..

[19]  B. Efron,et al.  Did Shakespeare write a newly-discovered poem? , 1987 .

[20]  D. Ruppert The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2004 .

[21]  John Burrows,et al.  'Delta': a Measure of Stylistic Difference and a Guide to Likely Authorship , 2002, Lit. Linguistic Comput..

[22]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[23]  Justin Zobel,et al.  Searching With Style: Authorship Attribution in Classic Literature , 2007, ACSC.

[24]  Matthew L. Jockers,et al.  Reassessing authorship of the Book of Mormon using delta and nearest shrunken centroid classification , 2008, Lit. Linguistic Comput..

[25]  Efstathios Stamatatos,et al.  A survey of modern authorship attribution methods , 2009, J. Assoc. Inf. Sci. Technol..

[26]  Jose Nilo G. Binongo,et al.  The application of principal component analysis to stylometry , 1999 .

[27]  Matthew L. Jockers,et al.  A comparative study of machine learning methods for authorship attribution , 2010, Lit. Linguistic Comput..

[28]  R. Harald Baayen,et al.  Analyzing linguistic data: a practical introduction to statistics using R, 1st Edition , 2008 .

[29]  J. M. Hughes,et al.  Quantitative patterns of stylistic influence in the evolution of literature , 2012, Proceedings of the National Academy of Sciences.

[30]  Craig,et al.  Shakespeare, Computers, and the Mystery of Authorship , 2009 .

[31]  Darnes Vilariño Ayala,et al.  Unsupervised Method for the Authorship Identification Task , 2014, CLEF.

[32]  Mihaela Juganaru-Mathieu,et al.  UJM at CLEF in Author Identification Notebook for PAN at CLEF 2014 , 2014, CLEF.

[33]  Anselmo Peñas,et al.  A Simple Measure to Assess Non-response , 2011, ACL.

[34]  I.N. Bozkurt,et al.  Authorship attribution , 2007, 2007 22nd international symposium on computer and information sciences.

[35]  Moshe Koppel,et al.  Measuring Differentiability: Unmasking Pseudonymous Authors , 2007, J. Mach. Learn. Res..

[36]  David I. Holmes,et al.  The diary of a public man: a case study in traditional and non-traditional authorship attribution , 2010, Lit. Linguistic Comput..

[37]  Benno Stein,et al.  Overview of the Author Identification Task at PAN-2017: Style Breach Detection and Author Clustering , 2017, CLEF.

[38]  Ian Witten,et al.  Data Mining , 2000 .

[39]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.