Authorship Attribution of E-Mail: Comparing Classifiers over a New Corpus for Evaluation

The release of the Enron corpus provided a unique resource for studying aspects of email use, because it is largely unfiltered, and therefore presents a relatively complete collection of emails for a reasonably large number of correspondents. This paper describes a newly created subcorpus of the Enron emails which we suggest can be used to test techniqes for authorship attribution, and further shows the application of three different classification methods to this task to present baseline results. Two of the classifiers used are are standard, and have been shown to perform well in the literature, and one of the classifiers is novel and based on concurrent work that proposes a Bayesian hierarchical distribution for word counts in documents. For each of the classifiers, we present results using six text representations, including use of linguistic structures derived from a parser as well as lexical information.

[1]  Martin Jansche,et al.  Parametric Models of Linguistic Count Data , 2003, ACL.

[2]  Thorsten Joachims,et al.  Making large scale SVM learning practical , 1998 .

[3]  Jörg Kindermann,et al.  Authorship Attribution with Support Vector Machines , 2003, Applied Intelligence.

[4]  Tom M. Mitchell,et al.  Improving Text Classification by Shrinkage in a Hierarchy of Classes , 1998, ICML.

[5]  Stephen A. Lowe The Beta-Binomial Mixture Model and Its Application to TDT Tracking and Detection , 1999 .

[6]  Andrew McCallum,et al.  A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[7]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[8]  D. Holmes,et al.  The Provenance of De Doctrina Christiana, attributed to John Milton: A Statistical Investigation , 1998 .

[9]  Hsinchun Chen,et al.  Applying authorship analysis to extremist-group Web forum messages , 2005, IEEE Intelligent Systems.

[10]  Yiming Yang,et al.  A re-examination of text categorization methods , 1999, SIGIR '99.

[11]  H. van Halteren,et al.  Outside the cave of shadows: using syntactic annotation to enhance authorship attribution , 1996 .

[12]  John Burrows,et al.  Word-Patterns and Story-Shapes: The Statistical Analysis of Narrative Style , 1987 .

[13]  Thomas Merriam,et al.  Heterogeneous authorship in early Shakespeare and the problem of Henry V , 1998 .

[14]  David Kauchak,et al.  Modeling word burstiness using the Dirichlet distribution , 2005, ICML.

[16]  Ted Briscoe,et al.  The Second Release of the RASP System , 2006, ACL.

[17]  David D. Lewis,et al.  Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval , 1998, ECML.

[18]  Yiming Yang,et al.  The Enron Corpus: A New Dataset for Email Classi(cid:12)cation Research , 2004 .

[19]  Louise Guthrie,et al.  Document Classification By Machine: Theory and Practice , 1994, COLING.

[20]  Jason D. M. Rennie,et al.  Improving Multiclass Text Classification with the Support Vector Machine , 2001 .

[21]  Susan T. Dumais,et al.  Inductive learning algorithms and representations for text categorization , 1998, CIKM '98.

[22]  Jon N. Hale,et al.  The Provenance of De Doctrina Christiana , 1997 .

[23]  F. Mosteller,et al.  Inference and Disputed Authorship: The Federalist , 1966 .