Gender Prediction Using Browsing History

Demographic attributes such as gender and age of Internet users provide important information for marketing, personalization, and user behavior research. This paper addresses the problem of predicting users’ gender based on browsing history. We employ a classification-based approach to the problem and investigate a number of features derived from browsing log data. We show that high-level content features such as topics or categories are very predictive of gender and combining such features with features derived from access times and browsing patterns leads to significant improvements in prediction accuracy. We empirically verified the effectiveness of the method on real datasets from Vietnamese online media. The method substantially outperformed a baseline, and achieved a macro-averaged F1 score of 0.805. Experimental results also demonstrate the effectiveness of combining different feature types: a combination of features achieved 12% improvement of F1 score over the best performing individual feature type.

[1]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[2]  Adrian Popescu,et al.  Mining User Home Location and Gender from Flickr Tags , 2010, ICWSM.

[3]  Hua Li,et al.  Demographic prediction based on user's browsing behavior , 2007, WWW '07.

[4]  Tu Minh Phuong,et al.  A keyword-topic model for contextual advertising , 2012, SoICT '12.

[5]  Sara Rosenthal,et al.  Age Prediction in Blogs: A Study of Style, Content, and Online Behavior in Pre- and Post-Social Media Generations , 2011, ACL.

[6]  John D. Burger,et al.  Discriminating Gender on Twitter , 2011, EMNLP.

[7]  Ana-Maria Popescu,et al.  A Machine Learning Approach to Twitter User Classification , 2011, ICWSM.

[8]  D. Seibold,et al.  Female and Male Managers’ and Professionals’ Criticism Giving , 2000 .

[9]  Robert H. Warren,et al.  Age and Geographic Inferences of the LiveJournal Social Network , 2006, SNA@ICML.

[10]  Daniel Gillick,et al.  Can conversational word usage be used to predict speaker demographics? , 2010, INTERSPEECH.

[11]  Thomas L. Griffiths,et al.  Probabilistic author-topic models for information discovery , 2004, KDD.

[12]  David Yarowsky,et al.  Modeling Latent Biographic Attributes in Conversational Genres , 2009, ACL.

[13]  Lois Ann Scheidt,et al.  Bridging the gap: a genre analysis of Weblogs , 2004, 37th Annual Hawaii International Conference on System Sciences, 2004. Proceedings of the.

[14]  George Karypis,et al.  Content-Based Methods for Predicting Web-Site Demographic Attributes , 2010, 2010 IEEE International Conference on Data Mining.

[15]  David Ellis Social (distributed) language modeling, clustering and dialectometry , 2009, Graph-based Methods for Natural Language Processing.

[16]  Shlomo Argamon,et al.  Effects of Age and Gender on Blogging , 2006, AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs.

[17]  Katja Filippova,et al.  User Demographics and Language in an Implicit Social Network , 2012, EMNLP.

[18]  Jahna Otterbacher,et al.  Inferring gender of movie reviewers: exploiting writing style, content and metadata , 2010, CIKM.

[19]  Jon Oberlander,et al.  The Identity of Bloggers: Openness and Gender in Personal Weblogs , 2006, AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs.

[20]  John C. Paolillo,et al.  Gender and genre variation in weblogs , 2006 .