User demographics prediction based on mobile data

Demographics prediction is an important component of user profile modeling. The accurate prediction of users' demographics can help promote many applications, ranging from web search, personalization to behavior targeting. In this paper, we focus on how to predict users' demographics, including ''gender'', ''job type'', ''marital status'', ''age'' and ''number of family members'', based on mobile data, such as users' usage logs, physical activities and environmental contexts. The core idea is to build a supervised learning framework, where each user is represented as a feature vector and users' demographics are considered as prediction targets. The most important component is to construct features from raw data and then supervised learning models can be applied. We propose a feature construction framework, CFC (contextual feature construction), where each feature is defined as the conditional probability of one user activity under the given contexts. Consequently, besides employing standard supervised learning models, we propose a regularized multi-task learning framework to model different kinds of demographics predictions collectively. We also propose a cost-sensitive classification framework for regression tasks, in order to benefit from the existing dimension reduction methods. Finally, due to the limited training instances, we employ ensemble to avoid overfitting. The experimental results show that the framework achieves classification accuracies on ''gender'', ''job'' and ''marital status'' as high as 96%, 83% and 86%, respectively, and achieves Root Mean Square Error (RMSE) on ''age'' and ''number of family members'' as low as 0.69 and 0.66 respectively, under the leave-one-out evaluation.

[1]  Igor Kononenko,et al.  Estimating Attributes: Analysis and Extensions of RELIEF , 1994, ECML.

[2]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[3]  Nicu Sebe,et al.  Friends don't lie: inferring personality traits from social network structure , 2012, UbiComp.

[4]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[5]  Qiang Yang,et al.  Transfer Learning for Text Mining , 2012, Mining Text Data.

[6]  Christopher J. C. Burges,et al.  A Tutorial on Support Vector Machines for Pattern Recognition , 1998, Data Mining and Knowledge Discovery.

[7]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[8]  Jahna Otterbacher,et al.  Inferring gender of movie reviewers: exploiting writing style, content and metadata , 2010, CIKM.

[9]  Alex Pentland,et al.  Predicting Personality Using Novel Mobile Phone-Based Metrics , 2013, SBP.

[10]  Bernhard Schölkopf,et al.  Kernel Principal Component Analysis , 1997, International Conference on Artificial Neural Networks.

[11]  Rakesh Gupta,et al.  Text Categorization with Knowledge Transfer from Heterogeneous Data Sources , 2008, AAAI.

[12]  S T Roweis,et al.  Nonlinear dimensionality reduction by locally linear embedding. , 2000, Science.

[13]  Bernhard Schölkopf,et al.  A tutorial on support vector regression , 2004, Stat. Comput..

[14]  Imad Aad,et al.  The Mobile Data Challenge: Big Data for Mobile Computing Research , 2012 .

[15]  I. Jolliffe Principal Component Analysis , 2002 .

[16]  Ingmar Weber,et al.  The demographics of web search , 2010, SIGIR.

[17]  Grigorios Tsoumakas,et al.  Multi-Label Classification: An Overview , 2007, Int. J. Data Warehous. Min..

[18]  Krishna P. Gummadi,et al.  You are who you know: inferring user profiles in online social networks , 2010, WSDM '10.

[19]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[20]  Daniel Gatica-Perez,et al.  Who's Who with Big-Five: Analyzing and Classifying Personality Traits with Smartphones , 2011, 2011 15th Annual International Symposium on Wearable Computers.

[21]  Hua Li,et al.  Demographic prediction based on user's browsing behavior , 2007, WWW '07.

[22]  Chong-Wah Ngo,et al.  PageSense: style-wise web page advertising , 2010, WWW '10.

[23]  Jieping Ye,et al.  A convex formulation for learning shared structures from multiple tasks , 2009, ICML '09.

[24]  Paul Tseng,et al.  Trace Norm Regularization: Reformulations, Algorithms, and Multi-Task Learning , 2010, SIAM J. Optim..

[25]  Larry A. Rendell,et al.  A Practical Approach to Feature Selection , 1992, ML.

[26]  Markus Koch,et al.  Linking visual concept detection with viewer demographics , 2012, ICMR '12.

[27]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[28]  Carl E. Rasmussen,et al.  Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[29]  J. Ross Quinlan,et al.  Bagging, Boosting, and C4.5 , 1996, AAAI/IAAI, Vol. 1.

[30]  Rich Caruana,et al.  Multitask Learning , 1997, Machine Learning.

[31]  S. Sawilowsky Nonparametric Tests of Interaction in Experimental Design , 1990 .

[32]  Jean-Philippe Vert,et al.  Clustered Multi-Task Learning: A Convex Formulation , 2008, NIPS.

[33]  Rong Yan,et al.  Generalized and Heuristic-Free Feature Construction for Improved Accuracy , 2010, SDM.

[34]  Marko Robnik-Sikonja,et al.  An adaptation of Relief for attribute estimation in regression , 1997, ICML.

[35]  Ya Zhang,et al.  Multi-task learning for boosting with application to web search ranking , 2010, KDD.