Inferring the demographics of search users: social data meets search queries

Knowing users' views and demographic traits offers a great potential for personalizing web search results or related services such as query suggestion and query completion. Such signals however are often only available for a small fraction of search users, namely those who log in with their social network account and allow its use for personalization of search results. In this paper, we offer a solution to this problem by showing how user demographic traits such as age and gender, and even political and religious views can be efficiently and accurately inferred based on their search query histories. This is accomplished in two steps; we first train predictive models based on the publically available myPersonality dataset containing users' Facebook Likes and their demographic information. We then match Facebook Likes with search queries using Open Directory Project categories. Finally, we apply the model trained on Facebook Likes to large-scale query logs of a commercial search engine while explicitly taking into account the difference between the traits distribution in both datasets. We find that the accuracy of classifying age and gender, expressed by the area under the ROC curve (AUC), are 77% and 84% respectively for predictions based on Facebook Likes, and only degrade to 74% and 80% when based on search queries. On a US state-by-state basis we find a Pearson correlation of 0.72 for political views between the predicted scores and Gallup data, and 0.54 for affiliation with Judaism between predicted scores and data from the US Religious Landscape Survey. We conclude that it is indeed feasible to infer important demographic data of users from their query history based on labelled Likes data and believe that this approach could provide valuable information for personalization and monetization even in the absence of demographic data.

[1]  Hua Li,et al.  Demographic prediction based on user's browsing behavior , 2007, WWW '07.

[2]  Venkata Rama Kiran Garimella,et al.  Political search trends , 2012, SIGIR '12.

[3]  Ravi Kumar,et al.  "I know what you did last summer": query logs and user privacy , 2007, CIKM '07.

[4]  Aron Culotta,et al.  Towards detecting influenza epidemics by analyzing Twitter messages , 2010, SOMA '10.

[5]  Daniele Quercia,et al.  Our Twitter Profiles, Our Selves: Predicting Personality with Twitter , 2011, 2011 IEEE Third Int'l Conference on Privacy, Security, Risk and Trust and 2011 IEEE Third Int'l Conference on Social Computing.

[6]  Franco Turini,et al.  Time-Annotated Sequences for Medical Data Mining , 2007 .

[7]  Ido Guy,et al.  Personalized social search based on the user's social network , 2009, CIKM.

[8]  Pushmeet Kohli,et al.  Personality and patterns of Facebook usage , 2012, WebSci '12.

[9]  Yiqun Liu,et al.  Detecting epidemic tendency by mining search logs , 2010, WWW '10.

[10]  Jahna Otterbacher,et al.  Inferring gender of movie reviewers: exploiting writing style, content and metadata , 2010, CIKM.

[11]  Susan T. Dumais,et al.  Classification-enhanced ranking , 2010, WWW '10.

[12]  Yoram Bachrach,et al.  Personality and Website Choice , 2012 .

[13]  Ingmar Weber,et al.  What and how children search on the web , 2011, CIKM '11.

[14]  John H. Gerdes,et al.  Using web-based search data to predict macroeconomic statistics , 2005, CACM.

[15]  Qiang Yang,et al.  Transferring Naive Bayes Classifiers for Text Classification , 2007, AAAI.

[16]  Venkata Rama Kiran Garimella,et al.  Mining web query logs to analyze political issues , 2012, WebSci '12.

[17]  Ramesh Nallapati,et al.  A Comparative Study of Methods for Transductive Transfer Learning , 2007, Seventh IEEE International Conference on Data Mining Workshops (ICDMW 2007).

[18]  Bernard J. Jansen,et al.  Gender demographic targeting in sponsored search , 2010, CHI.

[19]  Ingmar Weber,et al.  Who uses web search for what: and how , 2011, WSDM '11.

[20]  Ingmar Weber,et al.  Demographic information flows , 2010, CIKM '10.

[21]  Bianca Zadrozny,et al.  Learning and evaluating classifiers under sample selection bias , 2004, ICML.

[22]  Ana-Maria Popescu,et al.  Democrats, republicans and starbucks afficionados: user classification in twitter , 2011, KDD.

[23]  Eugene Kharitonov,et al.  Gender-aware re-ranking , 2012, SIGIR '12.

[24]  Philip S. Yu,et al.  An improved categorization of classifier's sensitivity on sample selection bias , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[25]  Ingmar Weber,et al.  The demographics of web search , 2010, SIGIR.

[26]  Filip Radlinski,et al.  Inferring and using location metadata to personalize web search , 2011, SIGIR.

[27]  Jeremy Ginsberg,et al.  Detecting influenza epidemics using search engine query data , 2009, Nature.

[28]  Thorsten Joachims,et al.  The influence of task and gender on search and evaluation behavior using Google , 2006, Inf. Process. Manag..

[29]  Daniel Marcu,et al.  Domain Adaptation for Statistical Classifiers , 2006, J. Artif. Intell. Res..

[30]  Vincent S. Tseng,et al.  Demographic Prediction Based on User's Mobile Behaviors , 2012 .

[31]  David M. Pennock,et al.  Predicting consumer behavior with Web search , 2010, Proceedings of the National Academy of Sciences.