On Dynamic Topic Models for Mining Social Media

Analyzing media in real time is of great importance with social media platforms at the epicenter of crunching, digesting, and disseminating content to individuals connected to these platforms. Within this context, topic models, specially latent Dirichlet allocation (LDA), have gained strong momentum due to their scalability, inference power, and their compact semantics. Although, state-of-the-art topic models come short in handling streaming large chunks of data arriving dynamically onto the platform, thus hindering their quality of interpretation as well as their adaptability to information overload. In this manuscript (Jaradat et al. OLLDA: a supervised and dynamic topic mining framework in twitter. In: 2015 IEEE international conference on data mining workshop (ICDMW), November 2015. IEEE, Piscataway, pp. 1354–1359), we evaluate a labeled and online extension to LDA (OLLDA), which incorporates supervision through external labeling and capability of quickly digesting real-time updates thus making it more adaptive to Twitter and platforms alike. Our proposed extension has capability of handling large quantities of newly arrived documents in a stream, and at the same time, is capable of achieving high topic inference quality given the short and often sloppy text of tweets. Our approach mainly uses an approximate inference technique based on variational inference coupled with a labeled LDA (L-LDA) model. We conclude by presenting experiments using a 1-year crawl of Twitter data that shows significantly improved topical inference as well as temporal user profile classification when compared to state-of-the-art baselines. Given the popularity of words’ prediction techniques such as Word2vec, we present an additional benchmark to measure the performance of classification.

[1]  Somnath Datta,et al.  msSurv: An R Package for Nonparametric Estimation of Multistate Models , 2012 .

[2]  Michael I. Jordan,et al.  Variational inference for Dirichlet process mixtures , 2006 .

[3]  Mihhail Matskin,et al.  OLLDA: A Supervised and Dynamic Topic Mining Framework in Twitter , 2015, 2015 IEEE International Conference on Data Mining Workshop (ICDMW).

[4]  Alexander J. Smola,et al.  An architecture for parallel topic models , 2010, Proc. VLDB Endow..

[5]  Matthew Michelson,et al.  Tweet Disambiguate Entities Retrieve Folksonomy SubTree Step 1 : Discover Categories Generate Topic Profile from SubTrees Step 2 : Discover Profile Topic Profile : “ English Football ” “ World Cup ” , 2010 .

[6]  Weiming Hu,et al.  Topic Detection for Discussion Threads with Domain Knowledge , 2010, 2010 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology.

[7]  Gabriela Andreea Morar,et al.  Exploring the Meaning behind Twitter Hashtags through Clustering , 2012, BIS.

[8]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[9]  Hagai Attias,et al.  A Variational Bayesian Framework for Graphical Models , 1999 .

[10]  David M. Blei,et al.  Supervised Topic Models , 2007, NIPS.

[11]  Kurt Hornik,et al.  Spherical k-Means Clustering , 2012 .

[12]  Francis R. Bach,et al.  Online Learning for Latent Dirichlet Allocation , 2010, NIPS.

[13]  Scott Sanner,et al.  Improving LDA topic models for microblogs via tweet pooling and automatic labeling , 2013, SIGIR.

[14]  Zhiyuan Liu,et al.  PLDA+: Parallel latent dirichlet allocation with data placement and pipeline processing , 2011, TIST.

[15]  Yanqing Zhang,et al.  Using Word2Vec to process big text data , 2015, 2015 IEEE International Conference on Big Data (Big Data).

[16]  Daniele Quercia,et al.  TweetLDA: supervised topic classification and link prediction in Twitter , 2012, WebSci '12.

[17]  Krishna P. Gummadi,et al.  Inferring user interests in the Twitter social network , 2014, RecSys '14.

[18]  Ramesh Nallapati,et al.  Labeled LDA: A supervised topic model for credit attribution in multi-labeled corpora , 2009, EMNLP.

[19]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[20]  Andrew McCallum,et al.  Efficient methods for topic model inference on streaming document collections , 2009, KDD.

[21]  Daniel Gillblad,et al.  Predicting Swedish elections with Twitter: A case for stochastic link structure analysis , 2015, 2015 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM).

[22]  Christopher E. Moody,et al.  Mixing Dirichlet Topic Models and Word Embeddings to Make lda2vec , 2016, ArXiv.

[23]  Alexander J. Smola,et al.  Online Inference for the Infinite Topic-Cluster Model: Storylines from Streaming Text , 2011, AISTATS.

[24]  Marco Pennacchiotti,et al.  Investigating topic models for social media user recommendation , 2011, WWW.

[25]  Eugene Agichtein,et al.  TM-LDA: efficient online modeling of latent topic transitions in social media , 2012, KDD.

[26]  Susan T. Dumais,et al.  Characterizing Microblogs with Topic Models , 2010, ICWSM.

[27]  Quoc V. Le,et al.  Distributed Representations of Sentences and Documents , 2014, ICML.

[28]  Ali Javed A hybrid approach to semantic hashtag clustering in social media , 2016 .

[29]  Jun Ota,et al.  Intuitive Topic Discovery by Incorporating Word-Pair's Connection Into LDA , 2012, 2012 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology.

[30]  Thomas L. Griffiths,et al.  Online Inference of Topics with Latent Dirichlet Allocation , 2009, AISTATS.

[31]  Daniel Barbará,et al.  On-line LDA: Adaptive Topic Models for Mining Text Streams with Applications to Topic Detection and Tracking , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[32]  Yun Zhu,et al.  Support vector machines and Word2vec for text classification with semantic features , 2015, 2015 IEEE 14th International Conference on Cognitive Informatics & Cognitive Computing (ICCI*CC).

[33]  Max Welling,et al.  Asynchronous Distributed Learning of Topic Models , 2008, NIPS.

[34]  Michael I. Jordan,et al.  An Introduction to Variational Methods for Graphical Models , 1999, Machine Learning.