Language-independent Gender Prediction on Twitter

In this paper we present a set of experiments and analyses on predicting the gender of Twitter users based on language-independent features extracted either from the text or the metadata of users’ tweets. We perform our experiments on the TwiSty dataset containing manual gender annotations for users speaking six different languages. Our classification results show that, while the prediction model based on language-independent features performs worse than the bag-of-words model when training and testing on the same language, it regularly outperforms the bag-of-words model when applied to different languages, showing very stable results across various languages. Finally we perform a comparative analysis of feature effect sizes across the six languages and show that differences in our features correspond to cultural distances.

[1]  Dirk Hovy,et al.  Personality Traits on Twitter—or—How to Get 1,500 Personality Tests in a Week , 2015, WASSA@EMNLP.

[2]  Brian Larson,et al.  Gender as a Variable in Natural-Language Processing: Ethical Considerations , 2017, EthNLP@EACL.

[3]  John D. Burger,et al.  Discriminating Gender on Twitter , 2011, EMNLP.

[4]  Shlomo Argamon,et al.  Automatically Categorizing Written Texts by Author Gender , 2002, Lit. Linguistic Comput..

[5]  David Bamman,et al.  Gender identity and lexical variation in social media , 2012, 1210.4567.

[6]  Zachary Miller,et al.  Gender Prediction on Twitter Using Stream Algorithms with N-Gram Character Features , 2012 .

[7]  Walter Daelemans,et al.  TwiSty: A Multilingual Twitter Stylometry Corpus for Gender and Personality Profiling , 2016, LREC.

[8]  Philip S. Yu,et al.  Language independent gender classification on Twitter , 2013, 2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2013).

[9]  Berkant Barla Cambazoglu,et al.  Chat Mining for Gender Prediction , 2006, ADVIS.

[10]  Margaret L. Kern,et al.  Personality, Gender, and Age in the Language of Social Media: The Open-Vocabulary Approach , 2013, PloS one.

[11]  Dirk Hovy,et al.  Demographic Factors Improve Classification Performance , 2015, ACL.

[12]  Shlomo Argamon,et al.  Effects of Age and Gender on Blogging , 2006, AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs.

[13]  Benno Stein,et al.  Overview of the 4th Author Profiling Task at PAN 2016: Cross-Genre Evaluations , 2016, CLEF.