Semi-supervised Gender Classification with Joint Textual and Social Modeling

In gender classification, labeled data is often limited while unlabeled data is ample. This motivates semi-supervised learning for gender classification to improve the performance by exploring the knowledge in both labeled and unlabeled data. In this paper, we propose a semi-supervised approach to gender classification by leveraging textual features and a specific kind of indirect links among the users which we call “same-interest” links. Specifically, we propose a factor graph, namely Textual and Social Factor Graph (TSFG), to model both the textual and the “same-interest” link information. Empirical studies demonstrate the effectiveness of the proposed approach to semi-supervised gender classification.

[1]  Nan Liu A New Method for Micro-blog Platform Users Classification Based on Infinitesimal-time , 2013 .

[2]  David Yarowsky,et al.  Exploring Demographic Language Variations to Improve Multilingual Sentiment Analysis in Social Media , 2013, EMNLP.

[3]  Carolyn Penstein Rosé,et al.  Modeling of Stylistic Variation in Social Media with Stretchy Patterns , 2011 .

[4]  Daisuke Ikeda,et al.  Semi-Supervised Learning for Blog Classification , 2008, AAAI.

[5]  George M. Mohay,et al.  Gender-preferential text mining of e-mail discourse , 2002, 18th Annual Computer Security Applications Conference, 2002. Proceedings..

[6]  Arjun Mukherjee,et al.  Improving Gender Classification of Blog Authors , 2010, EMNLP.

[7]  Saif Mohammad,et al.  Tracking Sentiment in Mail: How Genders Differ on Emotional Axes , 2011, WASSA@ACL.

[8]  Katja Filippova,et al.  User Demographics and Language in an Implicit Social Network , 2012, EMNLP.

[9]  John D. Burger,et al.  Discriminating Gender on Twitter , 2011, EMNLP.

[10]  Jon Oberlander,et al.  The Identity of Bloggers: Openness and Gender in Personal Weblogs , 2006, AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs.

[11]  David Yarowsky,et al.  Classifying latent user attributes in twitter , 2010, SMUC '10.

[12]  Nikolaos Aletras,et al.  An analysis of the user occupational class through Twitter content , 2015, ACL.

[13]  Sara Rosenthal,et al.  Age Prediction in Blogs: A Study of Style, Content, and Online Behavior in Pre- and Post-Social Media Generations , 2011, ACL.

[14]  Walter Daelemans,et al.  Predicting age and gender in online social networks , 2011, SMUC '11.

[15]  Brendan J. Frey,et al.  A Revolution: Belief Propagation in Graphs with Cycles , 1997, NIPS.

[16]  Yiming Yang,et al.  A re-examination of text categorization methods , 1999, SIGIR '99.

[17]  Derek Ruths,et al.  Gender Inference of Twitter Users in Non-English Contexts , 2013, EMNLP.

[18]  Guodong Zhou,et al.  Leveraging Interactive Knowledge and Unlabeled Data in Gender Classification with Co-training , 2015, DASFAA Workshops.

[19]  Rayid Ghani,et al.  Analyzing the effectiveness and applicability of co-training , 2000, CIKM '00.