Estimating the probability of an authorship attribution

In authorship attribution, various distance‐based metrics have been proposed to determine the most probable author of a disputed text. In this paradigm, a distance is computed between each author profile and the query text. These values are then employed only to rank the possible authors. In this article, we analyze their distribution and show that we can model it as a mixture of 2 Beta distributions. Based on this finding, we demonstrate how we can derive a more accurate probability that the closest author is, in fact, the real author. To evaluate this approach, we have chosen 4 authorship attribution methods (Burrows' Delta, Kullback‐Leibler divergence, Labbé's intertextual distance, and the naïve Bayes). As the first test collection, we have downloaded 224 State of the Union addresses (from 1790 to 2014) delivered by 41 U.S. presidents. The second test collection is formed by the Federalist Papers. The evaluations indicate that the accuracy rate of some authorship decisions can be improved. The suggested method can signal that the proposed assignment should be interpreted as possible, without strong certainty. Being able to quantify the certainty associated with an authorship decision can be a useful component when important decisions must be taken.

[1]  David L. Hoover,et al.  Testing Burrows's Delta , 2004, Lit. Linguistic Comput..

[2]  Craig,et al.  Shakespeare, Computers, and the Mystery of Authorship , 2009 .

[3]  Hans van Halteren,et al.  New Machine Learning Methods Demonstrate the Existence of a Human Stylome , 2005, J. Quant. Linguistics.

[4]  Rong Zheng,et al.  A framework for authorship identification of online messages: Writing-style features and classification techniques , 2006, J. Assoc. Inf. Sci. Technol..

[5]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[6]  Daniela M. Witten,et al.  An Introduction to Statistical Learning: with Applications in R , 2013 .

[7]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[8]  Donna R. Hoffman,et al.  Addressing the State of the Union: The Evolution And Impact of the Presidents's Big Speech , 2006 .

[9]  Dan Klein,et al.  Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network , 2003, NAACL.

[10]  Jacques Savoy,et al.  Comparative evaluation of term selection functions for authorship attribution , 2015, Digit. Scholarsh. Humanit..

[11]  Christopher M. Bishop,et al.  Pattern Recognition and Machine Learning (Information Science and Statistics) , 2006 .

[12]  Justin Zobel,et al.  Searching With Style: Authorship Attribution in Classic Literature , 2007, ACSC.

[13]  J. M. Hughes,et al.  Quantitative patterns of stylistic influence in the evolution of literature , 2012, Proceedings of the National Academy of Sciences.

[14]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[15]  F. Mosteller,et al.  Inference and Disputed Authorship: The Federalist , 1966 .

[16]  Shlomo Argamon,et al.  Automatically profiling the author of an anonymous text , 2009, CACM.

[17]  Jacques Savoy,et al.  The Federalist Papers revisited: A collaborative attribution scheme , 2013, ASIST.

[18]  Jacques Savoy,et al.  Authorship Attribution Based on Specific Vocabulary , 2012, TOIS.

[19]  John Burrows,et al.  'Delta': a Measure of Stylistic Difference and a Guide to Likely Authorship , 2002, Lit. Linguistic Comput..

[20]  Matthew L. Jockers,et al.  A comparative study of machine learning methods for authorship attribution , 2010, Lit. Linguistic Comput..

[21]  Fuchun Peng,et al.  N-GRAM-BASED AUTHOR PROFILES FOR AUTHORSHIP ATTRIBUTION , 2003 .

[22]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[23]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[24]  Thomas Merriam,et al.  Heterogeneous authorship in early Shakespeare and the problem of Henry V , 1998 .

[25]  Efstathios Stamatatos,et al.  A survey of modern authorship attribution methods , 2009, J. Assoc. Inf. Sci. Technol..

[26]  W. Nelson Francis,et al.  FREQUENCY ANALYSIS OF ENGLISH USAGE: LEXICON AND GRAMMAR , 1983 .

[27]  Trevor Hastie,et al.  An Introduction to Statistical Learning , 2013, Springer Texts in Statistics.

[28]  David Barber,et al.  Bayesian reasoning and machine learning , 2012 .

[29]  Navneet Kaur,et al.  Opinion mining and sentiment analysis , 2016, 2016 3rd International Conference on Computing for Sustainable Global Development (INDIACom).

[30]  Dominique Labbé,et al.  Experiments on authorship attribution by intertextual distance in english* , 2007, J. Quant. Linguistics.

[31]  Moshe Koppel,et al.  Measuring Differentiability: Unmasking Pseudonymous Authors , 2007, J. Mach. Learn. Res..

[32]  Jacques Savoy,et al.  Clustering : An Application with the State of the Union Addresses , 2015 .