Why Gender and Age Prediction from Tweets is Hard: Lessons from a Crowdsourcing Experiment

There is a growing interest in automatically predicting the gender and age of authors from texts. However, most research so far ignores that language use is related to the social identity of speakers, which may be different from their biological identity. In this paper, we combine insights from sociolinguistics with data collected through an online game, to underline the importance of approaching age and gender as social variables rather than static biological variables. In our game, thousands of players guessed the gender and age of Twitter users based on tweets alone. We show that more than 10% of the Twitter users do not employ language that the crowd associates with their biological sex. It is also shown that older Twitter users are often perceived to be younger. Our findings highlight the limitations of current approaches to gender and age prediction from texts.

[1]  Federica Barbieri Patterns of age-based linguistic variation in American English , 2008 .

[2]  Clifton B. Kruse Jr. Esq. How Old Do You Think I Am , 2001 .

[3]  Shlomo Argamon,et al.  Effects of Age and Gender on Blogging , 2006, AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs.

[4]  John D. Burger,et al.  Discriminating Gender on Twitter , 2011, EMNLP.

[5]  J. Butler Gender Trouble: Feminism and the Subversion of Identity , 1990 .

[6]  Victor Kuperman,et al.  Crowdsourcing and language studies: the new generation of linguistic data , 2010, Mturk@HLT-NAACL.

[7]  P. Eckert,et al.  Language and Gender: Introduction to the study of language and gender , 2013 .

[8]  Derek Ruths,et al.  Classifying Political Orientation on Twitter: It's Not Easy! , 2013, ICWSM.

[9]  J. Pennebaker,et al.  PERSONALITY PROCESSES AND INDIVIDUAL DIFFERENCES Words of Wisdom: Language Use Over the Life Span , 2003 .

[10]  Lennart E. Nacke,et al.  From game design elements to gamefulness: defining "gamification" , 2011, MindTrek.

[11]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[12]  Brendan T. O'Connor,et al.  Cheap and Fast – But is it Good? Evaluating Non-Expert Annotations for Natural Language Tasks , 2008, EMNLP.

[13]  Chris Welty,et al.  Crowd Truth: Harnessing disagreement in crowdsourcing a relation extraction gold standard , 2013 .

[14]  Variation and the indexical field , 1970 .

[15]  David Bamman,et al.  Gender identity and lexical variation in social media , 2012, 1210.4567.

[16]  Katja Filippova,et al.  User Demographics and Language in an Implicit Social Network , 2012, EMNLP.

[17]  John Le,et al.  Ensuring quality in crowdsourced search relevance evaluation: The effects of training question distribution , 2010 .

[18]  David Yarowsky,et al.  Classifying latent user attributes in twitter , 2010, SMUC '10.

[19]  Jacob Eisenstein,et al.  What to do about bad language on the internet , 2013, NAACL.

[20]  David Yarowsky,et al.  Exploring Demographic Language Variations to Improve Multilingual Sentiment Analysis in Social Media , 2013, EMNLP.

[21]  Rachel Giora,et al.  Rethinking language and gender research: Theory and practice , 1998 .

[22]  J. Holmes,et al.  The handbook of language and gender , 2003 .

[23]  Benjamin Van Durme,et al.  Using Conceptual Class Attributes to Characterize Social Media Users , 2013, ACL.

[24]  O. Linton Local Regression Models , 2010 .

[25]  Sudeshna Sarkar,et al.  Stylometric Analysis of Bloggers' Age and Gender , 2009, ICWSM.

[26]  Clayton Fink,et al.  Inferring Gender from the Content of Tweets: A Region Specific Example , 2012, ICWSM.

[27]  Victoria L. Bergvall Toward a comprehensive theory of language and gender , 1999, Language in Society.

[28]  P. Eckert Variation and the indexical field 1 , 2008 .

[29]  J. Lorber,et al.  Beyond the Binaries: Depolarizing the Categories of Sex, Sexuality, and Gender* , 1996 .

[30]  Reid G. Simmons,et al.  Perception of Personality and Naturalness through Dialogues by Native Speakers of American English and Arabic , 2011, SIGDIAL Conference.

[31]  A. D. Shveĭt︠s︡er,et al.  Introduction to sociolinguistics , 1986 .

[32]  R. Ordelman,et al.  Improved cyberbullying detection using gender information , 2012 .

[33]  Dong Nguyen,et al.  TweetGenie: Development, Evaluation, and Lessons Learned , 2014, COLING.

[34]  Danah Boyd,et al.  I tweet honestly, I tweet passionately: Twitter users, context collapse, and the imagined audience , 2011, New Media Soc..

[35]  Carolyn Penstein Rosé,et al.  Author Age Prediction from Text using Linear Regression , 2011, LaTeCH@ACL.

[36]  David Yarowsky,et al.  Modeling Latent Biographic Attributes in Conversational Genres , 2009, ACL.

[37]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[38]  Arjun Mukherjee,et al.  Improving Gender Classification of Blog Authors , 2010, EMNLP.

[39]  P. Eckert Three Waves of Variation Study: The Emergence of Meaning in the Study of Sociolinguistic Variation , 2012 .

[40]  Walter Daelemans,et al.  Predicting age and gender in online social networks , 2011, SMUC '11.

[41]  Kira Hall,et al.  Identity and interaction: a sociocultural linguistic approach , 2005, Discourse Studies.

[42]  Dong Nguyen,et al.  "How Old Do You Think I Am?" A Study of Language and Age in Twitter , 2013, ICWSM.

[43]  Susan T. Dumais,et al.  Mark my words!: linguistic style accommodation in social media , 2011, WWW.

[44]  Suzanne Evans Wagner,et al.  Age Grading in Sociolinguistic Theory , 2012, Lang. Linguistics Compass.

[45]  Sara Rosenthal,et al.  Age Prediction in Blogs: A Study of Style, Content, and Online Behavior in Pre- and Post-Social Media Generations , 2011, ACL.