Author gender identification from text

Text is still the most prevalent Internet media type. Examples of this include popular social networking applications such as Twitter, Craigslist, Facebook, etc. Other web applications such as e-mail, blog, chat rooms, etc. are also mostly text based. A question we address in this paper that deals with text based Internet forensics is the following: given a short text document, can we identify if the author is a man or a woman? This question is motivated by recent events where people faked their gender on the Internet. Note that this is different from the authorship attribution problem. In this paper we investigate author gender identification for short length, multi-genre, content-free text, such as the ones found in many Internet applications. Fundamental questions we ask are: do men and women inherently use different classes of language styles? If this is true, what are good linguistic features that indicate gender? Based on research in human psychology, we propose 545 psycho-linguistic and gender-preferential cues along with stylometric features to build the feature space for this identification problem. Note that identifying the correct set of features that indicate gender is an open research problem. Three machine learning algorithms (support vector machine, Bayesian logistic regression and AdaBoost decision tree) are then designed for gender identification based on the proposed features. Extensive experiments on large text corpora (Reuters Corpus Volume 1 newsgroup data and Enron e-mail data) indicate an accuracy up to 85.1% in identifying the gender. Experiments also indicate that function words, word-based features and structural features are significant gender discriminators.

[1]  David A. Landgrebe,et al.  A survey of decision tree classifier methodology , 1991, IEEE Trans. Syst. Man Cybern..

[2]  Mary Crawford,et al.  Talking Difference: On Gender and Language , 1995 .

[3]  David Madigan,et al.  Large-Scale Bayesian Logistic Regression for Text Categorization , 2007, Technometrics.

[4]  P. McCullagh Estimating the Number of Unseen Species: How Many Words did Shakespeare Know? , 2008 .

[5]  Bernhard E. Boser,et al.  A training algorithm for optimal margin classifiers , 1992, COLT '92.

[6]  A. Mulac,et al.  Effects of gender-linked language differences in adults' written discourse: Multivariate tests of language effects , 1994 .

[7]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[8]  Dale Schuurmans,et al.  Language independent authorship attribution using character level language models , 2003, Proceedings of the tenth conference on European chapter of the Association for Computational Linguistics - EACL '03.

[9]  R. Lakoff Language and woman's place , 1973, Language in Society.

[10]  James W. Pennebaker,et al.  Emotion, Disclosure, and Health , 1995 .

[11]  M. Aizerman,et al.  Theoretical Foundations of the Potential Function Method in Pattern Recognition Learning , 1964 .

[12]  J. Pennebaker,et al.  Lying Words: Predicting Deception from Linguistic Styles , 2003, Personality & social psychology bulletin.

[13]  John Burrows,et al.  Word-Patterns and Story-Shapes: The Statistical Analysis of Narrative Style , 1987 .

[14]  S. Rosenberg,et al.  Verbal behavior and schizophrenia. The semantic dimension. , 1979, Archives of general psychiatry.

[15]  Tony Lawson,et al.  Gender: An Introduction , 1985 .

[16]  Robert Matthews,et al.  Connection strength from input Connection strength from hidden node i to hidden node j node j to output node k Discriminator , 2005 .

[17]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[18]  R. H. Baayen,et al.  An experiment in authorship attribution , 2002 .

[19]  Thomas Merriam Marlowe’s Hand in Edward III Revisited , 1996 .

[20]  S. Fienberg,et al.  Inference and Disputed Authorship: The Federalist , 1966 .

[21]  Sheridan D. Blau,et al.  The gender-linked language effect in primary and secondary students' impromptu essays , 1990 .

[22]  D. Holmes,et al.  The Federalist Revisited: New Directions in Authorship Attribution , 1995 .

[23]  A. Mulac The Gender-Linked Language Effect: Do Language Differences Really Make a Difference? , 2006 .

[24]  David I. Holmes,et al.  Neural network applications in stylometry: The Federalist Papers , 1996, Comput. Humanit..

[25]  Cindy K. Chung,et al.  The development and psychometric properties of LIWC2007 , 2007 .

[26]  Hsinchun Chen,et al.  Applying authorship analysis to extremist-group Web forum messages , 2005, IEEE Intelligent Systems.

[27]  G. Udny Yule,et al.  The statistical study of literary vocabulary , 1944 .

[28]  Zdenek Salzmann,et al.  Language and gender : an introduction , 2000 .

[29]  Frederick Mosteller,et al.  Applied Bayesian and classical inference : the case of the Federalist papers , 1984 .

[30]  Hsinchun Chen,et al.  Visualizing Authorship for Identification , 2006, ISI.

[31]  Hsinchun Chen,et al.  A framework for authorship identification of online messages: Writing-style features and classification techniques , 2006 .

[32]  Louis A. Gottschalk,et al.  The Measurement of Psychological States Through the Content Analysis of Verbal Behavior , 2023 .

[33]  George M. Mohay,et al.  Language and Gender Author Cohort Analysis of E-mail for Computer Forensics , 2002 .

[34]  Jörg Kindermann,et al.  Authorship Attribution with Support Vector Machines , 2003, Applied Intelligence.

[35]  Louisa Sadler,et al.  Structural Non-Correspondence in Translation , 1991, EACL.

[36]  D. Holmes A Stylometric Analysis of Mormon Scripture and Related Texts , 1992 .

[37]  Rajarathnam Chandramouli,et al.  Gender identification from E-mails , 2009, 2009 IEEE Symposium on Computational Intelligence and Data Mining.

[38]  T C Mendenhall,et al.  THE CHARACTERISTIC CURVES OF COMPOSITION. , 1887, Science.