Discriminating gender on Chinese microblog: A study of online behaviour, writing style and preferred vocabulary

As user attributes are useful for applications such as personalized recommendation, adverting and so on, user attribute predication on Twitter has attracted intensive attentions in recent years. Although Chinese micro-blogging services are different from Twitter on various aspects such as language, user behaviours and so on, few efforts have been made on Chinese micro-blogging services. In this paper, we propose a gender prediction model for Chinese microblog which exploits features including online behaviour, writing style, and preferred vocabulary. Experimental results on Sina Weibo, which is one of the most popular micro-blogging services in China, show that our model achieves the state-of-the-art accuracy 94.3%. We also find significant distinctions between male and female microblog users on online behaviour, writing style and preferred vocabulary, which would be helpful for improving personalized applications.

[1]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[2]  Clifton B. Kruse Jr. Esq. How Old Do You Think I Am , 2001 .

[3]  Robert H. Warren,et al.  Age and Geographic Inferences of the LiveJournal Social Network , 2006, SNA@ICML.

[4]  Shlomo Argamon,et al.  Effects of Age and Gender on Blogging , 2006, AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs.

[5]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[6]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[7]  Sudeshna Sarkar,et al.  Stylometric Analysis of Bloggers' Age and Gender , 2009, ICWSM.

[8]  David Yarowsky,et al.  Classifying latent user attributes in twitter , 2010, SMUC '10.

[9]  Sune Lehmann,et al.  Understanding the Demographics of Twitter Users , 2011, ICWSM.

[10]  Brendan T. O'Connor,et al.  Part-of-Speech Tagging for Twitter: Annotation, Features, and Experiments , 2010, ACL.

[11]  John D. Burger,et al.  Discriminating Gender on Twitter , 2011, EMNLP.

[12]  Shaoyong Chen,et al.  Comparision of microblogging service between Sina Weibo and Twitter , 2011, Proceedings of 2011 International Conference on Computer Science and Network Technology.

[13]  Clayton Fink,et al.  Inferring Gender from the Content of Tweets: A Region Specific Example , 2012, ICWSM.

[14]  Mirella Lapata,et al.  Tweet Recommendation with Graph Co-Ranking , 2012, ACL.

[15]  Nathanael Chambers,et al.  Learning for Microblogs with Distant Supervision: Political Forecasting with Twitter , 2012, EACL.

[16]  Yong Yu,et al.  A comparative study of users' microblogging behavior on sina weibo and twitter , 2012, UMAP.

[17]  Xiaojun Wan,et al.  Collective Opinion Target Extraction in Chinese Microblogs , 2013, EMNLP.

[18]  Jun Zhao,et al.  Mining Opinion Words and Opinion Targets in a Two-Stage Framework , 2013, ACL.

[19]  Margaret L. Kern,et al.  Personality, Gender, and Age in the Language of Social Media: The Open-Vocabulary Approach , 2013, PloS one.

[20]  Dong Nguyen,et al.  "How Old Do You Think I Am?" A Study of Language and Age in Twitter , 2013, ICWSM.

[21]  Derek Ruths,et al.  Gender Inference of Twitter Users in Non-English Contexts , 2013, EMNLP.

[22]  Ting Liu,et al.  Microblog Entity Linking by Leveraging Extra Posts , 2013, EMNLP.

[23]  Kam-Fai Wong,et al.  Is Twitter A Better Corpus for Measuring Sentiment Similarity? , 2013, EMNLP.