Towards a better understanding of Burrows’s Delta in literary authorship attribution

Burrows’s Delta is the most established measure for stylometric difference in literary authorship attribution. Several improvements on the original Delta have been proposed. However, a recent empirical study showed that none of the proposed variants constitute a major improvement in terms of authorship attribution performance. With this paper, we try to improve our understanding of how and why these text distance measures work for authorship attribution. We evaluate the effects of standardization and vector normalization on the statistical distributions of features and the resulting text clustering quality. Furthermore, we explore supervised selection of discriminant words as a procedure for further improving authorship attribution.

[1]  Peter W. H. Smith,et al.  Improving Authorship Attribution: Optimizing Burrows' Delta Method* , 2011, J. Quant. Linguistics.

[2]  Shlomo Argamon,et al.  Computational methods in authorship attribution , 2009, J. Assoc. Inf. Sci. Technol..

[3]  Shlomo Argamon,et al.  Interpreting Burrows's Delta: Geometric and Probabilistic Foundations , 2007, Lit. Linguistic Comput..

[4]  Pablo Moscato,et al.  Language Individuation and Marker Words: Shakespeare and His Maxwell's Demon , 2013, PloS one.

[5]  David L. Hoover,et al.  Testing Burrows's Delta , 2004, Lit. Linguistic Comput..

[6]  Maciej Eder,et al.  Deeper Delta across genres and languages: do we really need the most frequent words? , 2011, Lit. Linguistic Comput..

[7]  David I. Holmes,et al.  Feature-Finding for Text Classification , 1996 .

[8]  Ali S. Hadi,et al.  Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[9]  Michael Bendersky,et al.  Towards Scalable Data-Driven Authorship Attribution , 2008 .

[10]  Mike Kestemont,et al.  Stylometry with R: a suite of tools , 2013, DH.

[11]  Stefan Evert,et al.  A Large Scale Evaluation of Distributional Semantic Models: Parameters, Interactions and Model Selection , 2014, TACL.

[12]  John Burrows,et al.  'Delta': a Measure of Stylistic Difference and a Guide to Likely Authorship , 2002, Lit. Linguistic Comput..

[13]  Efstathios Stamatatos,et al.  A survey of modern authorship attribution methods , 2009, J. Assoc. Inf. Sci. Technol..

[14]  L. Hubert,et al.  Comparing partitions , 1985 .

[15]  Yiming Yang,et al.  High-performing feature selection for text classification , 2002, CIKM '02.

[16]  Maciej Eder,et al.  Do birds of a feather really flock together, or how to choose training samples for authorship attribution , 2013, Lit. Linguistic Comput..

[17]  I.N. Bozkurt,et al.  Authorship attribution , 2007, 2007 22nd international symposium on computer and information sciences.