Understanding and explaining Delta measures for authorship attribution

This article builds on a mathematical explanation of one the most prominent stylometric measures, Burrows’s Delta (and its variants), to understand and explain its working. Starting with the conceptual separation between feature selection, feature scaling, and distance measures, we have designed a series of controlled experiments in which we used the kind of feature scaling (various types of standardization and normalization) and the type of distance measures (notably Manhattan, Euclidean, and Cosine) as independent variables and the correct authorship attributions as the dependent variable indicative of the performance of each of the methods proposed. In this way, we are able to describe in some detail how each of these two variables interact with each other and how they influence the results. Thus we can show that feature vector normalization, that is, the transformation of the feature vectors to a uniform length of 1 (implicit in the cosine measure), is the decisive factor for the improvement of Delta proposed recently. We are also able to show that the information particularly relevant to the identification of the author of a text lies in the profile of deviation across the most frequent words rather than in the extent of the deviation or in the deviation of specific words only. .................................................................................................................................................................................

[1]  Efstathios Stamatatos,et al.  A survey of modern authorship attribution methods , 2009, J. Assoc. Inf. Sci. Technol..

[2]  John Burrows,et al.  'Delta': a Measure of Stylistic Difference and a Guide to Likely Authorship , 2002, Lit. Linguistic Comput..

[3]  Shlomo Argamon,et al.  Interpreting Burrows's Delta: Geometric and Probabilistic Foundations , 2007, Lit. Linguistic Comput..

[4]  R. Schiffer Psychobiology of Language , 1986 .

[5]  Peter W. H. Smith,et al.  Improving Authorship Attribution: Optimizing Burrows' Delta Method* , 2011, J. Quant. Linguistics.

[6]  David Hinkley,et al.  Bootstrap Methods: Another Look at the Jackknife , 2008 .

[7]  George Kingsley Zipf,et al.  The Psychobiology of Language , 2022 .

[8]  I.N. Bozkurt,et al.  Authorship attribution , 2007, 2007 22nd international symposium on computer and information sciences.

[9]  Michael Oakes,et al.  Statistics for Corpus Linguistics , 1998 .

[10]  Mike Kestemont,et al.  Stylometry with R: A Package for Computational Text Analysis , 2016, R J..

[11]  John C. Hunter,et al.  Applying randomization tests to cluster analyses , 2004 .

[12]  Maciej Eder,et al.  Deeper Delta across genres and languages: do we really need the most frequent words? , 2011, Lit. Linguistic Comput..

[13]  Jose Nilo G. Binongo,et al.  The application of principal component analysis to stylometry , 1999 .

[14]  David L. Hoover,et al.  Delta Prime? , 2004, Lit. Linguistic Comput..

[15]  Brian Everitt,et al.  Cluster analysis , 1974 .

[16]  Shlomo Argamon,et al.  Computational methods in authorship attribution , 2009, J. Assoc. Inf. Sci. Technol..

[17]  Hidetoshi Shimodaira,et al.  Pvclust: an R package for assessing the uncertainty in hierarchical clustering , 2006, Bioinform..

[18]  David L. Hoover,et al.  Testing Burrows's Delta , 2004, Lit. Linguistic Comput..

[19]  Maciej Eder,et al.  Do birds of a feather really flock together, or how to choose training samples for authorship attribution , 2013, Lit. Linguistic Comput..