Discriminating Gender on Twitter

Accurate prediction of demographic attributes from social media and other informal online content is valuable for marketing, personalization, and legal investigation. This paper describes the construction of a large, multilingual dataset labeled with gender, and investigates statistical models for determining the gender of uncharacterized Twitter users. We explore several different classifier types on this dataset. We show the degree to which classifier accuracy varies based on tweet volumes as well as when various kinds of profile metadata are included in the models. We also perform a large-scale human assessment using Amazon Mechanical Turk. Our methods significantly out-perform both baseline models and almost all humans on the same task.

[1]  Brendan T. O'Connor,et al.  A Latent Variable Model for Geographic Lexical Variation , 2010, EMNLP.

[2]  Chris Callison-Burch,et al.  Creating Speech and Language Data With Amazon’s Mechanical Turk , 2010, Mturk@HLT-NAACL.

[3]  Arjun Mukherjee,et al.  Improving Gender Classification of Blog Authors , 2010, EMNLP.

[4]  Shlomo Argamon,et al.  Mining the Blogosphere: Age, gender and the varieties of self-expression , 2007, First Monday.

[5]  A. P. Dawid,et al.  Maximum Likelihood Estimation of Observer Error‐Rates Using the EM Algorithm , 1979 .

[6]  Miles Osborne,et al.  The Edinburgh Twitter Corpus , 2010, HLT-NAACL 2010.

[7]  N. Littlestone Learning Quickly When Irrelevant Attributes Abound: A New Linear-Threshold Algorithm , 1987, 28th Annual Symposium on Foundations of Computer Science (sfcs 1987).

[8]  S. Herring,et al.  Women and Children Last: The Discursive Construction of Weblogs , 2004 .

[9]  Shlomo Argamon,et al.  Effects of Age and Gender on Blogging , 2006, AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs.

[10]  Panagiotis G. Ipeirotis,et al.  Quality management on Amazon Mechanical Turk , 2010, HCOMP '10.

[11]  John D. Burger,et al.  An Exploration of Observable Features Related to Blogger Age , 2006, AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs.

[12]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[13]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[14]  David Yarowsky,et al.  Classifying latent user attributes in twitter , 2010, SMUC '10.