Authorship Attribution: A Comparative Study of Three Text Corpora and Three Languages

Abstract The first objective of this paper is carry out three experiments intended to evaluate authorship attribution methods based on three test-collections available in three different languages (English, French, and German). In the first we represent and categorize 52 text excerpts written by nine authors and taken from 19th century English novels. In the second we work with 44 segments from French novels written by eleven authors, mostly from the 19th century. In the third we extract 59 German text excerpts from novels published mainly during the 19th and the beginning of the 20th century, written by 15 authors. The second objective is to analyse performance differences obtained when using word types or lemmas as text representations, and the third objective is to evaluate three authorship attribution schemes, the first of which uses principal component analysis (PCA), the second applies the Delta approach, and the third corresponds to a new authorship attribution method based on specific vocabulary. This concept is computed for a given text (or author profile) and then compared with the entire corpus. Based on this information, we show how a distance measure can be derived and by means of the nearest neighbor approach we suggest a simple and efficient authorship attribution scheme. Based on three test collections and using either word types or lemmas as features, we demonstrate that the suggested classification scheme performs better than the PCA method, and slightly better than the Delta approach.

[1]  Jack Grieve,et al.  Quantitative Authorship Attribution: An Evaluation of Techniques , 2007, Lit. Linguistic Comput..

[2]  Elisabeth Dévière,et al.  Analyzing linguistic data: a practical introduction to statistics using R , 2009 .

[3]  A. Q. Morton,et al.  Once. A test of authorship based on words which are not repeated in the sample , 1986 .

[4]  Helmut Schmid,et al.  Improvements in Part-of-Speech Tagging with an Application to German , 1999 .

[5]  D. Holmes A Stylometric Analysis of Mormon Scripture and Related Texts , 1992 .

[6]  Paul A. Fortier,et al.  Le Codage des données textuelles , 2000 .

[7]  Gary Evans,et al.  Exploratory Multivariate Analysis by Example Using R , 2011 .

[8]  D. Labbé Si deux et deux sont quatre, Molière n'a pas écrit Dom Juan , 2009 .

[9]  Hugh Craig,et al.  Shakespeare, Computers, and the Mystery of Authorship: Plays in the corpus , 2009 .

[10]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[11]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[12]  J. F. Burrows,et al.  Not Unles You Ask Nicely: The Interpretative Nexus Between Analysis and Information , 1992 .

[13]  Geoffrey Sampson,et al.  Word frequency distributions , 2002, Computational Linguistics.

[14]  John Burrows,et al.  'Delta': a Measure of Stylistic Difference and a Guide to Likely Authorship , 2002, Lit. Linguistic Comput..

[15]  Mikhail Marusenko,et al.  Mathematical Methods for Attributing Literary Works when Solving the “Corneille–Molière” Problem* , 2010, J. Quant. Linguistics.

[16]  Dominique Labbe Normalisation et lemmatisation d'une question ouverte. Les femmes face au changement familial , 2001 .

[17]  David L. Hoover,et al.  Delta Prime? , 2004, Lit. Linguistic Comput..

[18]  Justin Zobel,et al.  Searching With Style: Authorship Attribution in Classic Literature , 2007, ACSC.

[19]  D. Holmes,et al.  The Federalist Revisited: New Directions in Authorship Attribution , 1995 .

[20]  Rong Zheng,et al.  A framework for authorship identification of online messages: Writing-style features and classification techniques , 2006, J. Assoc. Inf. Sci. Technol..

[21]  Claire Fautsch,et al.  Algorithmic stemmers or morphological analysisq An evaluation , 2009 .

[22]  D. Hoover Stylometry, Chronology and the Styles of Henry James , 2006 .

[23]  Dominique Labbé,et al.  Experiments on authorship attribution by intertextual distance in english* , 2007, J. Quant. Linguistics.

[24]  Jacques Savoy,et al.  Lexical Analysis of US Political Speeches , 2010, J. Quant. Linguistics.

[25]  David L. Hoover,et al.  Testing Burrows's Delta , 2004, Lit. Linguistic Comput..

[26]  Lukas Christian Erne [Review of:] Shakespeare, Computers, and the Mystery of Authorship (Cambridge, 2009) / Hugh Craig and Arthur F. Kinney (eds.) , 2010 .

[27]  Shlomo Argamon,et al.  Computational methods in authorship attribution , 2009, J. Assoc. Inf. Sci. Technol..

[28]  Pierre Nugues An Introduction to Language Processing with Perl and Prolog: An Outline of Theories, Implementation, and Application with Special Consideration of English, French, and German , 2006, Cognitive Technologies.

[29]  Shlomo Argamon,et al.  Automatically profiling the author of an anonymous text , 2009, CACM.

[30]  H. Love Attributing Authorship: An Introduction , 2002 .

[31]  I.N. Bozkurt,et al.  Authorship attribution , 2007, 2007 22nd international symposium on computer and information sciences.

[32]  Michael J. Crawley,et al.  The R book , 2022 .

[33]  Efstathios Stamatatos,et al.  A survey of modern authorship attribution methods , 2009, J. Assoc. Inf. Sci. Technol..

[34]  Shlomo Argamon,et al.  Interpreting Burrows's Delta: Geometric and Probabilistic Foundations , 2007, Lit. Linguistic Comput..

[35]  Matthew L. Jockers,et al.  Reassessing authorship of the Book of Mormon using delta and nearest shrunken centroid classification , 2008, Lit. Linguistic Comput..

[36]  Fuchun Peng,et al.  N-GRAM-BASED AUTHOR PROFILES FOR AUTHORSHIP ATTRIBUTION , 2003 .

[37]  Blaise Cronin,et al.  Vernacular and vehicular language , 2009, J. Assoc. Inf. Sci. Technol..

[38]  Ludovic Lebart,et al.  Exploring Textual Data , 1997 .

[39]  R. H. Baayen,et al.  An experiment in authorship attribution , 2002 .

[40]  Shlomo Argamon,et al.  A Mathematical Explanation of Burrows ’ s Delta ∗ , 2022 .

[41]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[42]  Jose Nilo G. Binongo,et al.  The application of principal component analysis to stylometry , 1999 .

[43]  Christopher D. Manning,et al.  Enriching the Knowledge Sources Used in a Maximum Entropy Part-of-Speech Tagger , 2000, EMNLP.

[44]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[45]  Claire Fautsch,et al.  Algorithmic stemmers or morphological analysis? An evaluation , 2009, J. Assoc. Inf. Sci. Technol..

[46]  D. Holmes The Evolution of Stylometry in Humanities Scholarship , 1998 .

[47]  Eric Brill,et al.  Transformation-Based Error-Driven Learning and Natural Language Processing: A Case Study in Part-of-Speech Tagging , 1995, CL.

[48]  David L. Hoover,et al.  Another Perspective on Vocabulary Richness , 2003, Comput. Humanit..

[49]  David I. Holmes,et al.  The diary of a public man: a case study in traditional and non-traditional authorship attribution , 2010, Lit. Linguistic Comput..

[50]  F. Mosteller,et al.  Inference and Disputed Authorship: The Federalist , 1966 .

[51]  Thomas Merriam,et al.  Heterogeneous authorship in early Shakespeare and the problem of Henry V , 1998 .

[52]  Matthew L. Jockers,et al.  A comparative study of machine learning methods for authorship attribution , 2010, Lit. Linguistic Comput..

[53]  C. Muller Principes et méthodes de statistique lexicale , 1992 .