Mining e-mail content for author identification forensics

We describe an investigation into e-mail content mining for author identification, or authorship attribution, for the purpose of forensic investigation. We focus our discussion on the ability to discriminate between authors for the case of both aggregated e-mail topics as well as across different e-mail topics. An extended set of e-mail document features including structural characteristics and linguistic patterns were derived and, together with a Support Vector Machine learning algorithm, were used for mining the e-mail content. Experiments using a number of e-mail documents generated by different authors on a set of topics gave promising results for both aggregated and multi-topic author categorisation.

[1]  F. Mosteller,et al.  Inference and Disputed Authorship: The Federalist , 1966 .

[2]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[3]  B. Efron,et al.  Did Shakespeare write a newly-discovered poem? , 1987 .

[4]  Curtis R. Cook,et al.  Programming style authorship analysis , 1989, CSC '89.

[5]  Robert J. Valenza,et al.  Was the Earl of Oxford the true Shakespeare , 1991 .

[6]  Eugene H. Spafford,et al.  Software forensics: Can we track code to its authors? , 1993, Comput. Secur..

[7]  Eugene H. Spafford,et al.  Software forensics: Tracking code to its authors , 1993 .

[8]  D. Lowe,et al.  Shakespeare vs. fletcher: A stylometric analysis by radial basis functions , 1995, Comput. Humanit..

[9]  D. Holmes,et al.  The Federalist Revisited: New Directions in Authorship Attribution , 1995 .

[10]  A. Q. Morton,et al.  Analysing for authorship : a guide to the cusum technique , 1996 .

[11]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[12]  William W. Cohen Learning Rules that Classify E-Mail , 1996 .

[13]  David I. Holmes,et al.  Neural network applications in stylometry: The Federalist Papers , 1996, Comput. Humanit..

[14]  Hwee Tou Ng,et al.  Feature selection, perceptron learning, and a usability case study for text categorization , 1997, SIGIR '97.

[15]  Joseph Rudman,et al.  The State of Authorship Attribution Studies: Some Problems and Solutions , 1997, Comput. Humanit..

[16]  Eugene H. Spafford,et al.  Authorship analysis: identifying the author of a program , 1997, Comput. Secur..

[17]  R. Harald Baayen,et al.  How Variable May a Constant be? Measures of Lexical Richness in Perspective , 1998, Comput. Humanit..

[18]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[19]  Stephen G. MacDonell,et al.  IDENTIFIED: software authorship analysis with case-based reasoning , 1998 .

[20]  Susan T. Dumais,et al.  A Bayesian Approach to Filtering Junk E-Mail , 1998, AAAI 1998.

[21]  Robert Bosch,et al.  Separating Hyperplanes and the Authorship of the Disputed Federalist Papers , 1998 .

[22]  Harris Drucker,et al.  Support vector machines for spam categorization , 1999, IEEE Trans. Neural Networks.

[23]  Yiming Yang,et al.  A re-examination of text categorization methods , 1999, SIGIR '99.

[24]  Olivier de Vel,et al.  Mining E-mail Authorship , 2000 .

[25]  D. W. Foster Author Unknown: On the Trail of Anonymous , 2000 .

[26]  S Waugh,et al.  Computational stylistics using artificial neural networks , 2000 .

[27]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[28]  Carole E. Chaski,et al.  Empirical evaluations of language-based author identification techniques , 2001 .

[29]  R. Thomson,et al.  Predicting gender from electronic discourse. , 2001, The British journal of social psychology.

[30]  George M. Mohay,et al.  Multi-Topic E-mail Authorship Attribution Forensics , 2001 .

[31]  George M. Mohay,et al.  Identifying the authors of suspect email , 2001 .

[32]  R. Jalam,et al.  Kernel-based text categorisation , 2001, IJCNN'01. International Joint Conference on Neural Networks. Proceedings (Cat. No.01CH37222).

[33]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques with Java implementations , 2002, SGMD.

[34]  Stephen G. MacDonell,et al.  Software Forensics: Extending Authorship Analysis Techniques to Computer Programs , 2002 .