Incorporating popularity in topic models for social network analysis

Topic models are used to group words in a text dataset into a set of relevant topics. Unfortunately, when a few words frequently appear in a dataset, the topic groups identified by topic models become noisy because these frequent words repeatedly appear in "irrelevant" topic groups. This noise has not been a serious problem in a text dataset because the frequent words (e.g., the and is) do not have much meaning and have been simply removed before a topic model analysis. However, in a social network dataset we are interested in, they correspond to popular persons (e.g., Barack Obama and Justin Bieber) and cannot be simply removed because most people are interested in them. To solve this "popularity problem", we explicitly model the popularity of nodes (words) in topic models. For this purpose, we first introduce a notion of a "popularity component" and propose topic model extensions that effectively accommodate the popularity component. We evaluate the effectiveness of our models with a real-world Twitter dataset. Our proposed models achieve significantly lower perplexity (i.e., better prediction power) compared to the state-of-the-art baselines. In addition to the popularity problem caused by the nodes with high incoming edge degree, we also investigate the effect of the outgoing edge degree with another topic model extensions. We show that considering outgoing edge degree does not help much in achieving lower perplexity.

[1]  Ian Ruthven,et al.  Improving social bookmark search using personalised latent variable language models , 2011, WSDM '11.

[2]  Zhenfu Cao,et al.  HTM: A Topic Model for Hypertexts , 2008, EMNLP.

[3]  Naonori Ueda,et al.  Modeling Social Annotation Data with Content Relevance using a Topic Model , 2009, NIPS.

[4]  Andrew McCallum,et al.  Group and topic discovery from relations and text , 2005, LinkKDD '05.

[5]  David M. Blei,et al.  Probabilistic topic models , 2012, Commun. ACM.

[6]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[7]  Michal Rosen-Zvi,et al.  Latent Topic Models for Hypertext , 2008, UAI.

[8]  Hongyuan Zha,et al.  Probabilistic models for discovering e-communities , 2006, WWW '06.

[9]  Sergey Brin,et al.  Reprint of: The anatomy of a large-scale hypertextual web search engine , 2012, Comput. Networks.

[10]  Thomas L. Griffiths,et al.  Probabilistic author-topic models for information discovery , 2004, KDD.

[11]  Xiaojin Zhu,et al.  Latent Dirichlet Allocation with Topic-in-Set Knowledge , 2009, HLT-NAACL 2009.

[12]  Steffen Bickel,et al.  Unsupervised prediction of citation influences , 2007, ICML '07.

[13]  Francis R. Bach,et al.  Online Learning for Latent Dirichlet Allocation , 2010, NIPS.

[14]  Edoardo M. Airoldi,et al.  Mixed Membership Stochastic Blockmodels , 2007, NIPS.

[15]  Ata Kabán,et al.  On an equivalence between PLSI and LDA , 2003, SIGIR.

[16]  Andrew McCallum,et al.  Topic and Role Discovery in Social Networks with Experiments on Enron and Academic Email , 2007, J. Artif. Intell. Res..

[17]  David M. Blei,et al.  Relational Topic Models for Document Networks , 2009, AISTATS.

[18]  A. M. Madni,et al.  Recommender systems in e-commerce , 2014, 2014 World Automation Congress (WAC).

[19]  Tina Eliassi-Rad,et al.  Applying latent dirichlet allocation to group discovery in large graphs , 2009, SAC '09.

[20]  Timothy W. Finin,et al.  Why We Twitter: An Analysis of a Microblogging Community , 2009, WebKDD/SNA-KDD.

[21]  Andrew McCallum,et al.  Rethinking LDA: Why Priors Matter , 2009, NIPS.

[22]  Alexander J. Smola,et al.  Scalable distributed inference of dynamic user interests for behavioral targeting , 2011, KDD.

[23]  John Hannon,et al.  Recommending twitter users to follow using content and collaborative filtering approaches , 2010, RecSys '10.

[24]  Henry Tirri,et al.  Combining Topic Models and Social Networks for Chat Data Mining , 2004, IEEE/WIC/ACM International Conference on Web Intelligence (WI'04).

[25]  David Cohn,et al.  Learning to Probabilistically Identify Authoritative Documents , 2000, ICML.

[26]  John Yen,et al.  An LDA-based Community Structure Discovery Approach for Large-Scale Social Networks , 2007, 2007 IEEE Intelligence and Security Informatics.

[27]  Marco Pennacchiotti,et al.  Investigating topic models for social media user recommendation , 2011, WWW.

[28]  Harald Steck,et al.  Item popularity and recommendation accuracy , 2011, RecSys '11.

[29]  J. Lafferty,et al.  Mixed-membership models of scientific publications , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[30]  Junghoo Cho,et al.  Social-network analysis using topic models , 2012, SIGIR '12.

[31]  David A. Cohn,et al.  The Missing Link - A Probabilistic Model of Document Content and Hypertext Connectivity , 2000, NIPS.

[32]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[33]  Peter A. Chew,et al.  Term Weighting Schemes for Latent Dirichlet Allocation , 2010, NAACL.

[34]  Ramesh Nallapati,et al.  Joint latent topic models for text and citations , 2008, KDD.

[35]  George A. F. Seber,et al.  Multinomial Distribution , 2011, International Encyclopedia of Statistical Science.

[36]  Qiang Yang,et al.  One-Class Collaborative Filtering , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[37]  Deng Cai,et al.  Topic modeling with network regularization , 2008, WWW.

[38]  Yifan Hu,et al.  Collaborative Filtering for Implicit Feedback Datasets , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[39]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[40]  Ramesh Nallapati,et al.  TopicFlow Model: Unsupervised Learning of Topic-specific Influences of Hyperlinked Documents , 2011, AISTATS.

[41]  A. Banerjee,et al.  Social Topic Models for Community Extraction , 2008 .