Language chunking, data sparseness, and the value of a long marker list: explorations with word n-grams and authorial attribution

The frequencies of individual words have been the mainstay of computer-assisted authorial attribution over the past three decades. The usefulness of this sort of data is attested in many benchmark trials and in numerous studies of particular authorship problems. It is sometimes argued, however, that since language as spoken or written falls into word sequences, on the ‘idiom principle’, and since language is characteristically produced in the brain in chunks, not in individual words, n-grams with n higher than 1 are superior to individual words as a source of authorship markers. In this article, we test the usefulness of word n-grams for authorship attribution by asking how many good-quality authorship markers are yielded by n-grams of various types, namely 1-grams, 2-grams, 3-grams, 4-grams, and 5-grams. We use two ways of formulating the n-grams, two corpora of texts, and two methods for finding and assessing markers. We find that when using methods based on regularly occurring markers, and drawing on all the available vocabulary, 1-grams perform best. With methods based on rare markers, and all the available vocabulary, strict 3-gram sequences perform best. If we restrict ourselves to a defined word-list of function-words to form n-grams, 2-grams offer a striking improvement on 1-grams. .................................................................................................................................................................................

[1]  David L. Hoover,et al.  The Rarer They Are, the More There Are, the Less They Matter , 2012, DH.

[2]  Dale Schuurmans,et al.  Augmenting Naive Bayes Classifiers with Statistical Language Models , 2004, Information Retrieval.

[3]  F. Mosteller,et al.  Inference and Disputed Authorship: The Federalist , 1966 .

[4]  John Burrows,et al.  All the Way Through: Testing for Authorship in Different Frequency Strata , 2007, Lit. Linguistic Comput..

[5]  David L. Hoover Frequent Word Sequences and Statistical Stylistics , 2002, Lit. Linguistic Comput..

[6]  Efstathios Stamatatos,et al.  A survey of modern authorship attribution methods , 2009, J. Assoc. Inf. Sci. Technol..

[7]  Maciej Eder,et al.  Deeper Delta across genres and languages: do we really need the most frequent words? , 2011, Lit. Linguistic Comput..

[8]  David L. Hoover,et al.  Testing Burrows's Delta , 2004, Lit. Linguistic Comput..

[9]  Patrick Juola,et al.  Authorship Attribution , 2008, Found. Trends Inf. Retr..

[10]  Michael Stubbs An example of frequent English phraseology: distributions, structures and functions , 2007 .

[11]  Illustration , 1965 .

[12]  A. Ellegård A statistical method for determining authorship : the Junius letters, 1769-1772 , 1962 .

[13]  Efstathios Stamatatos A survey of modern authorship attribution methods , 2009 .

[14]  Nick C. Ellis,et al.  Constructions, Chunking, and Connectionism: The Emergence of Second Language Structure , 2008 .

[15]  Maciej Eder,et al.  Style-markers in authorship attribution : a cross-language study of the authorial fingerprint , 2011 .

[16]  John Burrows Andrew Marvell and the 'painter satires': a computational approach to their authorship | NOVA. The University of Newcastle's Digital Repository , 2005 .

[17]  B. Vickers Shakespeare and Authorship Studies in the Twenty-First Century , 2011 .

[18]  John Burrows,et al.  'Delta': a Measure of Stylistic Difference and a Guide to Likely Authorship , 2002, Lit. Linguistic Comput..

[19]  R. Harald Baayen,et al.  Word Frequency Distributions , 2001 .

[20]  David L. Hoover,et al.  Frequent Collocations and Authorial Style , 2003, Lit. Linguistic Comput..

[21]  Ian Lancashire,et al.  Forgetful Muses: Reading the Author in the Text , 2010 .

[22]  David I. Holmes,et al.  Stylometry and the Civil War: The Case of the Pickett Letters , 2003 .

[23]  Simon Günter,et al.  Short Text Authorship Attribution via Sequence Kernels, Markov Chains and Author Unmasking: An Investigation , 2006, EMNLP.

[24]  Paolo Rosso,et al.  Authorship Attribution Using Word Sequences , 2006, CIARP.

[25]  David I. Holmes,et al.  The diary of a public man: a case study in traditional and non-traditional authorship attribution , 2010, Lit. Linguistic Comput..

[26]  I. Lancashire Empirically Determining Shakespeare's Idolect , 1997 .

[27]  J. Burrows,et al.  Authors and Characters , 2012 .

[28]  B. Vickers Identifying Shakespeare's Additions to The Spanish Tragedy (1602): A New(er) Approach , 2012 .