Gender-preferential text mining of e-mail discourse

This paper describes an investigation of authorship gender attribution mining from e-mail text documents. We used an extended set of predominantly topic content-free e-mail document features such as style markers, structural characteristics and gender-preferential language features together with a support vector machine learning algorithm. Experiments using a corpus of e-mail documents generated by a large number of authors of both genders gave promising results for author gender categorisation.

[1]  F. Mosteller,et al.  Inference and Disputed Authorship: The Federalist , 1966 .

[2]  Robert J. Valenza,et al.  Was the Earl of Oxford the true Shakespeare , 1991 .

[3]  Lakhmi C. Jain,et al.  Neural network applications , 1995, Proceedings Electronic Technology Directions to the Year 2000.

[4]  A. Q. Morton,et al.  Analysing for authorship : a guide to the cusum technique , 1996 .

[5]  David I. Holmes,et al.  Neural network applications in stylometry: The Federalist Papers , 1996, Comput. Humanit..

[6]  Vladimir Cherkassky,et al.  The Nature Of Statistical Learning Theory , 1997, IEEE Trans. Neural Networks.

[7]  R. Harald Baayen,et al.  How Variable May a Constant be? Measures of Lexical Richness in Perspective , 1998, Comput. Humanit..

[8]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[9]  Thorsten Joachims,et al.  Making large scale SVM learning practical , 1998 .

[10]  Susan T. Dumais,et al.  A Bayesian Approach to Filtering Junk E-Mail , 1998, AAAI 1998.

[11]  Harris Drucker,et al.  Support vector machines for spam categorization , 1999, IEEE Trans. Neural Networks.

[12]  Yiming Yang,et al.  A re-examination of text categorization methods , 1999, SIGIR '99.

[13]  Olivier de Vel,et al.  Mining E-mail Authorship , 2000 .

[14]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[15]  Carole E. Chaski,et al.  Empirical evaluations of language-based author identification techniques , 2001 .

[16]  R. Thomson,et al.  Predicting gender from electronic discourse. , 2001, The British journal of social psychology.

[17]  George M. Mohay,et al.  Identifying the authors of suspect email , 2001 .

[18]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques with Java implementations , 2002, SGMD.

[19]  George M. Mohay,et al.  E-Mail Authorship Attribution for Computer Forensics , 2002, Applications of Data Mining in Computer Security.

[20]  Jörg Kindermann,et al.  Authorship Attribution with Support Vector Machines , 2003, Applied Intelligence.