Collective Semi-Supervised Learning for User Profiling in Social Media

The abundance of user-generated data in social media has incentivized the development of methods to infer the latent attributes of users, which are crucially useful for personalization, advertising and recommendation. However, the current user profiling approaches have limited success, due to the lack of a principled way to integrate different types of social relationships of a user, and the reliance on scarcely-available labeled data in building a prediction model. In this paper, we present a novel solution termed Collective Semi-Supervised Learning (CSL), which provides a principled means to integrate different types of social relationship and unlabeled data under a unified computational framework. The joint learning from multiple relationships and unlabeled data yields a computationally sound and accurate approach to model user attributes in social media. Extensive experiments using Twitter data have demonstrated the efficacy of our CSL approach in inferring user attributes such as account type and marital status. We also show how CSL can be used to determine important user features, and to make inference on a larger user population.

[1]  Mikhail Belkin,et al.  Manifold Regularization: A Geometric Framework for Learning from Labeled and Unlabeled Examples , 2006, J. Mach. Learn. Res..

[2]  David Yarowsky,et al.  Classifying latent user attributes in twitter , 2010, SMUC '10.

[3]  J. Jensen Sur les fonctions convexes et les inégalités entre les valeurs moyennes , 1906 .

[4]  T. Graepel,et al.  Private traits and attributes are predictable from digital records of human behavior , 2013, Proceedings of the National Academy of Sciences.

[5]  Bernhard Schölkopf,et al.  Learning with Local and Global Consistency , 2003, NIPS.

[6]  Gideon S. Mann,et al.  Simple, robust, scalable semi-supervised learning via expectation regularization , 2007, ICML '07.

[7]  S. M. Ali,et al.  A General Class of Coefficients of Divergence of One Distribution from Another , 1966 .

[8]  Meng Wang,et al.  Unified Video Annotation via Multigraph Learning , 2009, IEEE Transactions on Circuits and Systems for Video Technology.

[9]  Tom Michael Mitchell,et al.  The Role of Unlabeled Data in Supervised Learning , 2004 .

[10]  Kristian Kersting,et al.  Multi-Relational Learning with Gaussian Processes , 2009, IJCAI.

[11]  Jorge Nocedal,et al.  On the limited memory BFGS method for large scale optimization , 1989, Math. Program..

[12]  Krishna P. Gummadi,et al.  You are who you know: inferring user profiles in online social networks , 2010, WSDM '10.

[13]  Yiming Yang,et al.  RCV1: A New Benchmark Collection for Text Categorization Research , 2004, J. Mach. Learn. Res..

[14]  Alexander Zien,et al.  Semi-Supervised Learning , 2006 .

[15]  Takafumi Kanamori,et al.  Mutual information estimation reveals global associations between stimuli and biological processes , 2009, BMC Bioinformatics.

[16]  Eduard H. Hovy,et al.  Weakly Supervised User Profile Extraction from Twitter , 2014, ACL.

[17]  R. Harshman,et al.  PARAFAC: parallel factor analysis , 1994 .

[18]  Gholamreza Haffari,et al.  Analysis of Semi-Supervised Learning with the Yarowsky Algorithm , 2007, UAI.

[19]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[20]  Kevin Chen-Chuan Chang,et al.  User profiling in an ego network: co-profiling attributes and relationships , 2014, WWW.

[21]  Andrew McCallum,et al.  An Introduction to Conditional Random Fields , 2010, Found. Trends Mach. Learn..

[22]  Yoshua Bengio,et al.  Entropy Regularization , 2006, Semi-Supervised Learning.

[23]  Geoffrey J. Gordon,et al.  Relational learning via collective matrix factorization , 2008, KDD.

[24]  Steven P. Abney Understanding the Yarowsky Algorithm , 2004, CL.

[25]  Xin Gao,et al.  Multiple graph regularized protein domain ranking , 2012, BMC Bioinformatics.

[26]  Nitesh V. Chawla,et al.  Inferring user demographics and social strategies in mobile social networks , 2014, KDD.

[27]  F. Wilcoxon Individual Comparisons by Ranking Methods , 1945 .

[28]  Gang Niu,et al.  Squared-loss Mutual Information Regularization: A Novel Information-theoretic Approach to Semi-supervised Learning , 2013, ICML.

[29]  Thorsten Joachims,et al.  Transductive Inference for Text Classification using Support Vector Machines , 1999, ICML.

[30]  Lars Schmidt-Thieme,et al.  Optimizing Multi-Relational Factorization Models for Multiple Target Relations , 2014, CIKM.

[31]  Avrim Blum,et al.  The Bottleneck , 2021, Monopsony Capitalism.

[32]  Nicolas Le Roux,et al.  A latent factor model for highly multi-relational data , 2012, NIPS.

[33]  Hans-Peter Kriegel,et al.  Factorizing YAGO: scalable machine learning for linked data , 2012, WWW.

[34]  Wei Chu,et al.  Probabilistic Models for Incomplete Multi-dimensional Arrays , 2009, AISTATS.

[35]  Teruo Higashino,et al.  Twitter user profiling based on text and community mining for market analysis , 2013, Knowl. Based Syst..