Author Identification of E-mail Messages with OLMAM Trained Feedforward Neural Networks

The OLMAM algorithm (optimized Levenberg-Marquardt with adaptive momentum) is a variant of the Levenberg-Marquardt algorithm for training multilayer feedforward neural networks. OLMAM has been shown to obtain excellent solutions in difficult classification problems where other computational intelligence techniques usually achieve inferior performances. In this paper we apply OLMAM to the problem of author identification of e-mail messages which is a challenging classification problem due to the special characteristics of the data. We performed a number of experiments with a corpus of real-world e-mail messages (Enron corpus). The performance of the proposed method was compared with the performances achieved by Naive-Bayes and SVM classifiers. Author identification with OLMAM was found to be significantly better compared with the other methods even if the author wrote about different topics.

[1]  Dimitris A. Karras,et al.  An efficient constrained learning algorithm with momentum acceleration , 1995, Neural Networks.

[2]  M. F. Porter,et al.  An algorithm for suffix stripping , 1997 .

[3]  George M. Mohay,et al.  Mining e-mail content for author identification forensics , 2001, SGMD.

[4]  Alistair Moffat,et al.  Exploring the similarity space , 1998, SIGF.

[5]  Stavros J. Perantonis,et al.  Two highly efficient second-order algorithms for training feedforward networks , 2002, IEEE Trans. Neural Networks.

[6]  Jörg Kindermann,et al.  Authorship Attribution with Support Vector Machines , 2003, Applied Intelligence.

[7]  Jorge Nocedal,et al.  Global Convergence Properties of Conjugate Gradient Methods for Optimization , 1992, SIAM J. Optim..

[8]  Andrew McCallum,et al.  A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[9]  Yiming Yang,et al.  The Enron Corpus: A New Dataset for Email Classi(cid:12)cation Research , 2004 .

[10]  Olivier de Vel,et al.  Mining E-mail Authorship , 2000 .

[11]  Robert Bosch,et al.  Separating Hyperplanes and the Authorship of the Disputed Federalist Papers , 1998 .

[12]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[13]  Mohammad Bagher Menhaj,et al.  Training feedforward networks with the Marquardt algorithm , 1994, IEEE Trans. Neural Networks.

[14]  Justin Zobel,et al.  How reliable are the results of large-scale information retrieval experiments? , 1998, SIGIR '98.

[15]  Scott E. Fahlman,et al.  An empirical study of learning speed in back-propagation networks , 1988 .

[16]  Kam-Fai Wong,et al.  Adapting pivoted document-length normalization for query size: Experiments in Chinese and English , 2006, TALIP.

[17]  Gerard Salton,et al.  Document Length Normalization , 1995, Inf. Process. Manag..

[18]  David I. Holmes,et al.  Neural network applications in stylometry: The Federalist Papers , 1996, Comput. Humanit..

[19]  Dmitry V. Khmelev Disputed Authorship Resolution through Using Relative Empirical Entropy for Markov Chains of Letters in Human Language Texts , 2000, J. Quant. Linguistics.

[20]  Robert Villa,et al.  The effectiveness of query-specific hierarchic clustering in information retrieval , 2002, Inf. Process. Manag..

[21]  G. Yule On the Theory of Correlation , 1897 .