An open-set size-adjusted Bayesian classifier for authorship attribution

Recent studies of authorship attribution have used machine‐learning methods including regularized multinomial logistic regression, neural nets, support vector machines, and the nearest shrunken centroid classifier to identify likely authors of disputed texts. These methods are all limited by an inability to perform open‐set classification and account for text and corpus size. We propose a customized Bayesian logit‐normal‐beta‐binomial classification model for supervised authorship attribution. The model is based on the beta‐binomial distribution with an explicit inverse relationship between extra‐binomial variation and text size. The model internally estimates the relationship of extra‐binomial variation to text size, and uses Markov Chain Monte Carlo (MCMC) to produce distributions of posterior authorship probabilities instead of point estimates. We illustrate the method by training the machine‐learning methods as well as the open‐set Bayesian classifier on undisputed papers of The Federalist, and testing the method on documents historically attributed to Alexander Hamilton, John Jay, and James Madison. The Bayesian classifier was the best classifier of these texts.

[1]  Brian D. Ripley,et al.  Modern applied statistics with S, 4th Edition , 2002, Statistics and computing.

[2]  Gregory L. Snow,et al.  Extended nearest shrunken centroid classification: A new method for open-set authorship attribution of texts of varying sizes , 2011, Lit. Linguistic Comput..

[3]  Patrick Juola,et al.  Authorship Attribution , 2008, Found. Trends Inf. Retr..

[4]  Maciej Eder,et al.  Does size matter? Authorship attribution, small samples, big problem , 2015, Digit. Scholarsh. Humanit..

[5]  Lior Rokach,et al.  Ensemble-based classifiers , 2010, Artificial Intelligence Review.

[6]  Paramesh Ray Independence of Irrelevant Alternatives , 1973 .

[7]  John A. Nelder,et al.  Generalized linear models. 2nd ed. , 1993 .

[8]  John Platt,et al.  Probabilistic Outputs for Support vector Machines and Comparisons to Regularized Likelihood Methods , 1999 .

[9]  Shlomo Argamon,et al.  Authorship attribution in the wild , 2010, Lang. Resour. Evaluation.

[10]  R. Tibshirani,et al.  Diagnosis of multiple cancer types by shrunken centroids of gene expression , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[11]  David I. Holmes,et al.  Neural network applications in stylometry: The Federalist Papers , 1996, Comput. Humanit..

[12]  Yoav Freund,et al.  Large Margin Classification Using the Perceptron Algorithm , 1998, COLT.

[13]  J. Madison,et al.  The writings of James Madison,comprising his public papers and private correspondence, including numerous letters and documents now for the first time printed. , 1910 .

[14]  Paul J. Fields,et al.  Open-Set Nearest Shrunken Centroid Classification , 2012 .

[15]  Walter J. Scheirer,et al.  Features From Frequency: Authorship and Stylistic Analysis Using Repetitive Sound , 2010 .

[16]  D. Holmes,et al.  The Federalist Revisited: New Directions in Authorship Attribution , 1995 .

[17]  Mee Young Park,et al.  Penalized logistic regression for detecting gene interactions. , 2008, Biostatistics.

[18]  John Burrows,et al.  'Delta': a Measure of Stylistic Difference and a Guide to Likely Authorship , 2002, Lit. Linguistic Comput..

[19]  E.C. Real,et al.  Open set classification using tolerance intervals , 2000, Conference Record of the Thirty-Fourth Asilomar Conference on Signals, Systems and Computers (Cat. No.00CH37154).

[20]  Shlomo Argamon,et al.  Computational methods in authorship attribution , 2009 .

[21]  Georg Heinze,et al.  A comparative investigation of methods for logistic regression with separated or nearly separated data , 2006, Statistics in medicine.

[22]  A. W. Kemp,et al.  Univariate Discrete Distributions , 1993 .

[23]  Efstathios Stamatatos,et al.  A survey of modern authorship attribution methods , 2009, J. Assoc. Inf. Sci. Technol..

[24]  John Burrows,et al.  Questions of Authorship: Attribution and Beyond A Lecture Delivered on the Occasion of the Roberto Busa Award ACH-ALLC 2001, New York , 2003, Comput. Humanit..

[25]  Bianca Zadrozny,et al.  Transforming classifier scores into accurate multiclass probability estimates , 2002, KDD.

[26]  Frederick Mosteller,et al.  Applied Bayesian and classical inference : the case of the Federalist papers , 1984 .

[27]  Walter Daelemans,et al.  The effect of author set size and data size in authorship attribution , 2011, Lit. Linguistic Comput..

[28]  Trevor Hastie,et al.  Class Prediction by Nearest Shrunken Centroids, with Applications to DNA Microarrays , 2003 .

[29]  Johan F. Hoorn,et al.  Neural network identification of poets using letter sequences , 1999 .

[30]  Matthew L. Jockers,et al.  A comparative study of machine learning methods for authorship attribution , 2010, Lit. Linguistic Comput..

[31]  A. W. Kemp,et al.  Univariate Discrete Distributions: Johnson/Univariate Discrete Distributions , 2005 .

[32]  FreundYoav,et al.  Large Margin Classification Using the Perceptron Algorithm , 1999 .

[33]  Matthew L. Jockers,et al.  Reassessing authorship of the Book of Mormon using delta and nearest shrunken centroid classification , 2008, Lit. Linguistic Comput..

[34]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[35]  Kurt Hornik,et al.  Misc Functions of the Department of Statistics (e1071), TU Wien , 2014 .

[36]  William N. Venables,et al.  Modern Applied Statistics with S , 2010 .

[37]  Tapabrata Maiti,et al.  Bayesian Data Analysis (2nd ed.) (Book) , 2004 .