Automatically Categorizing Written Texts by Author Gender

The problem of automatically determining the gender of a document's author would appear to be a more subtle problem than those of categorization by topic or authorship attribution. Nevertheless, it is shown that automated text categorization techniques can exploit combinations of simple lexical and syntactic features to infer the gender of the author of an unseen formal written document with approximately 80 per cent accuracy. The same techniques can be used to determine if a document is fiction or non-fiction with approximately 98 per cent accuracy.

[1]  G. Yule ON SENTENCE- LENGTH AS A STATISTICAL CHARACTERISTIC OF STYLE IN PROSE: WITH APPLICATION TO TWO CASES OF DISPUTED AUTHORSHIP , 1939 .

[2]  F. Mosteller,et al.  Inference and Disputed Authorship: The Federalist , 1966 .

[3]  M. Key LINGUISTIC BEHAVIOR OF MALE AND FEMALE , 1972 .

[4]  P. Trudgill Sex, covert prestige and linguistic change in the urban British English of Norwich , 1972, Language in Society.

[5]  Cynthia Berryman-Fink,et al.  A multivariate investigation of perceptual attributions concerning gender appropriateness in language , 1983 .

[6]  Edward Vanhoutte Literary and Linguistic Computing , 1986 .

[7]  N. Littlestone Learning Quickly When Irrelevant Attributes Abound: A New Linear-Threshold Algorithm , 1987, 28th Annual Symposium on Foundations of Computer Science (sfcs 1987).

[8]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[9]  W. Labov The intersection of sex and social class in the course of linguistic change , 1990, Language Variation and Change.

[10]  J. Holmes Hedges and boosters in women's and men's speech , 1990 .

[11]  Sheridan D. Blau,et al.  The gender-linked language effect in primary and secondary students' impromptu essays , 1990 .

[12]  J. Simkins-Bullock,et al.  An investigation into the relationships between gender and language , 1991 .

[13]  J. F. Burrows,et al.  Not Unles You Ask Nicely: The Interpretative Nexus Between Analysis and Information , 1992 .

[14]  Robert Matthews,et al.  Neural Computation in Stylometry I: An Application to the Works of Shakespeare and Fletcher , 1993 .

[15]  A. Mulac,et al.  Effects of gender-linked language differences in adults' written discourse: Multivariate tests of language effects , 1994 .

[16]  Manfred K. Warmuth,et al.  Additive versus exponentiated gradient updates for linear prediction , 1995, STOC '95.

[17]  Hinrich Schütze,et al.  A comparison of classifiers and document representations for the routing problem , 1995, SIGIR '95.

[18]  D. Holmes,et al.  The Federalist Revisited: New Directions in Authorship Attribution , 1995 .

[19]  David I. Holmes,et al.  Feature-Finding for Text Classification , 1996 .

[20]  S. Herring Two variants of an electronic message schema , 1996 .

[21]  H. van Halteren,et al.  Outside the cave of shadows: using syntactic annotation to enhance authorship attribution , 1996 .

[22]  Ido Dagan,et al.  Mistake-Driven Learning in Text Categorization , 1997, EMNLP.

[23]  P. Eckert Gender and sociolinguistic variation , 1997 .

[24]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[25]  Thomas Merriam,et al.  Distinguishing literary styles using neural networks , 1997 .

[26]  Susan Conrad,et al.  Corpus Linguistics: Investigating Language Structure and Use , 1998 .

[27]  D. Holmes The Evolution of Stylometry in Humanities Scholarship , 1998 .

[28]  Galit Avneri,et al.  Style-based Text Categorization: What Newspaper Am I Reading? , 1998 .

[29]  Mathias Kirsten,et al.  Exploring the Use of Linguistic Features in Domain and Genre Classification , 1999, EACL.

[30]  Minna Palander-Collin Male and female styles in 17th century correspondence: I THINK , 1999, Language Variation and Change.

[31]  Anthony McEnery,et al.  Authorship studies/textual statistics. , 2000 .

[32]  Efstathios Stamatatos,et al.  Computer-Based Authorship Attribution Without Lexical Measures , 2001, Comput. Humanit..

[33]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[34]  Anat Rachel Shimoni,et al.  Gender, genre, and writing style in formal written texts , 2003 .

[35]  Yiming Yang,et al.  An Evaluation of Statistical Approaches to Text Categorization , 1999, Information Retrieval.

[36]  Eric Bauer,et al.  An Empirical Comparison of Voting Classification Algorithms: Bagging, Boosting, and Variants , 1999, Machine Learning.