SLR: A scalable latent role model for attribute completion and tie prediction in social networks

Social networks are an important class of networks that span a wide variety of media, ranging from social websites such as Facebook and Google Plus, citation networks of academic papers and patents, caller networks in telecommunications, and hyperlinked document collections such as Wikipedia - to name a few. Many of these social networks now exceed millions of users or actors, each of which may be associated with rich attribute data such as user profiles in social websites and caller networks, or subject classifications in document collections and citation networks. Such attribute data is often incomplete for a number of reasons - for example, users may be unwilling to spend the effort to complete their profiles, while in the case of document collections, there may be insufficient human labor to accurately classify all documents. At the same time, the tie or link information in these networks may also be incomplete - in social websites, users may simply be unaware of potential acquaintances, while in citation networks, authors may be unaware of appropriate literature that should be referenced. Completing and predicting these missing attributes and ties is important to a spectrum of applications, such as recommendation, personalized search, and targeted advertising, yet large social networks can pose a scalability challenge to existing algorithms designed for this task. Towards this end, we propose an integrative probabilistic model, SLR, that captures both attribute and tie information simultaneously, and can be used for attribute completion and tie prediction, in order to enable the above mentioned applications. A key innovation in our model is the use of triangle motifs to represent ties in the network, in order to scale to networks with millions of nodes and beyond. Experiments on real world datasets show that SLR significantly improves the accuracy of attribute prediction and tie prediction compared to well-known methods, and our distributed, multi-machine implementation easily scales up to millions of users. In addition to fast and accurate attribute and tie prediction, we also demonstrate how SLR can identify the attributes most responsible for homophily within the network, thus revealing which attributes drive network tie formation.

[1]  J. L. Fischer Social Influences on the Choice of a Linguistic Variant , 1958 .

[2]  R. Merton The Matthew Effect in Science , 1968, Science.

[3]  M. McPherson,et al.  Birds of a Feather: Homophily in Social Networks , 2001 .

[4]  Peter D. Hoff,et al.  Latent Space Approaches to Social Network Analysis , 2002 .

[5]  S. Shen-Orr,et al.  Network motifs: simple building blocks of complex networks. , 2002, Science.

[6]  Jon Kleinberg,et al.  The link prediction problem for social networks , 2003, CIKM '03.

[7]  Thomas L. Griffiths,et al.  The Author-Topic Model for Authors and Documents , 2004, UAI.

[8]  T. Vicsek,et al.  Uncovering the overlapping community structure of complex networks in nature and society , 2005, Nature.

[9]  M E J Newman,et al.  Modularity and community structure in networks. , 2006, Proceedings of the National Academy of Sciences of the United States of America.

[10]  Jon M. Kleinberg,et al.  The link-prediction problem for social networks , 2007, J. Assoc. Inf. Sci. Technol..

[11]  Geoffrey J. Gordon,et al.  A Unified View of Matrix Factorization Models , 2008, ECML/PKDD.

[12]  Edoardo M. Airoldi,et al.  Mixed Membership Stochastic Blockmodels , 2007, NIPS.

[13]  Hong Cheng,et al.  Graph Clustering Based on Structural/Attribute Similarities , 2009, Proc. VLDB Endow..

[14]  David M. Blei,et al.  Relational Topic Models for Document Networks , 2009, AISTATS.

[15]  Yan Liu,et al.  Topic-link LDA: joint models of topic and author community , 2009, ICML '09.

[16]  Brendan T. O'Connor,et al.  A Latent Variable Model for Geographic Lexical Variation , 2010, EMNLP.

[17]  David Yarowsky,et al.  Classifying latent user attributes in twitter , 2010, SMUC '10.

[18]  Christos Faloutsos,et al.  Kronecker Graphs: An Approach to Modeling Networks , 2008, J. Mach. Learn. Res..

[19]  Sune Lehmann,et al.  Link communities reveal multiscale complexity in networks , 2009, Nature.

[20]  Kyumin Lee,et al.  You are where you tweet: a content-based approach to geo-locating twitter users , 2010, CIKM.

[21]  Ana-Maria Popescu,et al.  A Machine Learning Approach to Twitter User Classification , 2011, ICWSM.

[22]  Mason A. Porter,et al.  Social Structure of Facebook Networks , 2011, ArXiv.

[23]  John D. Burger,et al.  Discriminating Gender on Twitter , 2011, EMNLP.

[24]  William W. Cohen,et al.  Block-LDA: Jointly Modeling Entity-Annotated Text and Entity-Entity Links , 2014, Handbook of Mixed Membership Models and Their Applications.

[25]  Ana-Maria Popescu,et al.  Democrats, republicans and starbucks afficionados: user classification in twitter , 2011, KDD.

[26]  Jure Leskovec,et al.  Latent Multi-group Membership Graph Model , 2012, ICML.

[27]  Eric P. Xing,et al.  Document hierarchies from text and links , 2012, WWW.

[28]  Henry A. Kautz,et al.  Finding your friends and following them to where you are , 2012, WSDM '12.

[29]  Rui Wang,et al.  Towards social user profiling: unified and discriminative influence model for inferring home locations , 2012, KDD.

[30]  Jure Leskovec,et al.  Community-Affiliation Graph Model for Overlapping Network Community Detection , 2012, 2012 IEEE 12th International Conference on Data Mining.

[31]  Eric P. Xing,et al.  On Triangular versus Edge Representations --- Towards Scalable Modeling of Networks , 2012, NIPS.

[32]  Jure Leskovec,et al.  Defining and Evaluating Network Communities Based on Ground-Truth , 2012, ICDM.

[33]  Boleslaw K. Szymanski,et al.  Overlapping community detection in networks: The state-of-the-art and comparative study , 2011, CSUR.

[34]  Srinivasan Parthasarathy,et al.  Efficient community detection in large networks using content and links , 2012, WWW.

[35]  David M Blei,et al.  Efficient discovery of overlapping communities in massive networks , 2013, Proceedings of the National Academy of Sciences.

[36]  Eric P. Xing,et al.  A Scalable Approach to Probabilistic Latent Space Inference of Large-Scale Networks , 2013, NIPS.

[37]  Heyan Huang,et al.  Lifetime Lexical Variation in Social Media , 2014, AAAI.

[38]  Kevin Chen-Chuan Chang,et al.  User profiling in an ego network: co-profiling attributes and relationships , 2014, WWW.

[39]  William W. Cohen,et al.  Block-LDA: Jointly Modeling Entity-Annotated Text and Entity-Entity Links , 2014, Handbook of Mixed Membership Models and Their Applications.

[40]  Stephen J. Wright Coordinate descent algorithms , 2015, Mathematical Programming.

[41]  Yaoliang Yu,et al.  Petuum: A New Platform for Distributed Machine Learning on Big Data , 2015, IEEE Trans. Big Data.