Gender identification from E-mails

In this paper, we investigate the topic of gender identification for short length, multi-genre, content-free e-mails. We introduce for the first time (to our knowledge), psycholinguistic and gender-linked cues for this problem, along with traditional stylometric features. Decision tree and Support Vector Machines learning algorithms are used to identify the gender of the author of a given e-mail. The experiment results show that our approach is promising with an average accuracy of 82.2%.

[1]  H. T. Eddy The characteristic curves of composition. , 1887, Science.

[2]  Jörg Kindermann,et al.  Authorship Attribution with Support Vector Machines , 2003, Applied Intelligence.

[3]  Thomas Merriam Marlowe’s Hand in Edward III Revisited , 1996 .

[4]  Robert Matthews,et al.  Connection strength from input Connection strength from hidden node i to hidden node j node j to output node k Discriminator , 2005 .

[5]  Frederick Mosteller,et al.  Applied Bayesian and classical inference : the case of the Federalist papers , 1984 .

[6]  A. Mulac,et al.  Effects of gender-linked language differences in adults' written discourse: Multivariate tests of language effects , 1994 .

[7]  T C Mendenhall,et al.  THE CHARACTERISTIC CURVES OF COMPOSITION. , 1887, Science.

[8]  Shlomo Argamon,et al.  Automatically Categorizing Written Texts by Author Gender , 2002, Lit. Linguistic Comput..

[9]  Benjamin C. M. Fung,et al.  A novel approach of mining write-prints for authorship attribution in e-mail forensics , 2008, Digit. Investig..

[10]  D. Holmes A Stylometric Analysis of Mormon Scripture and Related Texts , 1992 .

[11]  R. Cartwright,et al.  The Measurement of Psychological States Through the Content Analysis of Verbal Behavior , 1971 .

[12]  Yiming Yang,et al.  An Evaluation of Statistical Approaches to Text Categorization , 1999, Information Retrieval.

[13]  David I. Holmes,et al.  Neural network applications in stylometry: The Federalist Papers , 1996, Comput. Humanit..

[14]  George M. Mohay,et al.  Gender-preferential text mining of e-mail discourse , 2002, 18th Annual Computer Security Applications Conference, 2002. Proceedings..

[15]  Sheridan D. Blau,et al.  The gender-linked language effect in primary and secondary students' impromptu essays , 1990 .

[16]  D. Holmes,et al.  The Federalist Revisited: New Directions in Authorship Attribution , 1995 .

[17]  P. McCullagh Estimating the Number of Unseen Species: How Many Words did Shakespeare Know? , 2008 .

[18]  John Burrows,et al.  Word-Patterns and Story-Shapes: The Statistical Analysis of Narrative Style , 1987 .

[19]  S. Rosenberg,et al.  Verbal behavior and schizophrenia. The semantic dimension. , 1979, Archives of general psychiatry.

[20]  F. Mosteller,et al.  Inference and Disputed Authorship: The Federalist , 1966 .

[21]  Rong Zheng,et al.  A framework for authorship identification of online messages: Writing-style features and classification techniques , 2006, J. Assoc. Inf. Sci. Technol..

[22]  D. L. Mealand Correspondence Analysis of Luke , 1995 .

[23]  Cindy K. Chung,et al.  The development and psychometric properties of LIWC2007 , 2007 .

[24]  James J. Bradac,et al.  Empirical Support for the Gender-as-Culture Hypothesis: An Intercultural Analysis of Male/Female Language Differences. , 2001 .