Using Hashtag Graph-Based Topic Model to Connect Semantically-Related Words Without Co-Occurrence in Microblogs

In this paper, we introduce a new topic model to understand the chaotic microblogging environment by using hashtag graphs. Inferring topics on Twitter becomes a vital but challenging task in many important applications. The shortness and informality of tweets leads to extreme sparse vector representations with a large vocabulary. This makes the conventional topic models (e.g., latent Dirichlet allocation [1] and latent semantic analysis [2]) fail to learn high quality topic structures. Tweets are always showing up with rich user-generated hashtags. The hashtags make tweets semi-structured inside and semantically related to each other. Since hashtags are utilized as keywords in tweets to mark messages or to form conversations, they provide an additional path to connect semantically related words. In this paper, treating tweets as semi-structured texts, we propose a novel topic model, denoted as Hashtag Graphbased Topic Model (HGTM) to discover topics of tweets. By utilizing hashtag relation information in hashtag graphs, HGTM is able to discover word semantic relations even if words are not co-occurred within a specific tweet. With this method, HGTM successfully alleviates the sparsity problem. Our investigation illustrates that the user-contributed hashtags could serve as weakly-supervised information for topic modeling, and the relation between hashtags could reveal latent semantic relation between words. We evaluate the effectiveness of HGTM on tweet (hashtag) clustering and hashtag classification problems. Experiments on two real-world tweet data sets show that HGTM has strong capability to handle sparseness and noise problem in tweets. Furthermore, HGTM can discover more distinct and coherent topics than the state-of-the-art baselines.

[1]  Jon Kleinberg,et al.  Differences in the mechanics of information diffusion across topics: idioms, political hashtags, and complex contagion on twitter , 2011, WWW.

[2]  Thomas L. Griffiths,et al.  Learning author-topic models from text corpora , 2010, TOIS.

[3]  John D. Lafferty,et al.  Correlated Topic Models , 2005, NIPS.

[4]  Hong Cheng,et al.  The dual-sparse topic model: mining focused topics and focused terms in short text , 2014, WWW.

[5]  Yalou Huang,et al.  Hashtag Graph Based Topic Model for Tweet Mining , 2014, 2014 IEEE International Conference on Data Mining.

[6]  Susan T. Dumais,et al.  Partially labeled topic models for interpretable text mining , 2011, KDD.

[7]  Xiaohui Yan,et al.  A biterm topic model for short texts , 2013, WWW.

[8]  Jun Zhang,et al.  Dirichlet Process Mixture Model for Document Clustering with Feature Partition , 2013, IEEE Transactions on Knowledge and Data Engineering.

[9]  Guan Huang,et al.  Tag-Weighted Dirichlet Allocation , 2013, 2013 IEEE 13th International Conference on Data Mining.

[10]  Flora S. Tsai A tag-topic model for blog mining , 2011, Expert Syst. Appl..

[11]  Yulan He,et al.  Joint sentiment/topic model for sentiment analysis , 2009, CIKM.

[12]  Scott Sanner,et al.  Improving LDA topic models for microblogs via tweet pooling and automatic labeling , 2013, SIGIR.

[13]  Padhraic Smyth,et al.  Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model , 2006, NIPS.

[14]  Hongfei Yan,et al.  Comparing Twitter and Traditional Media Using Topic Models , 2011, ECIR.

[15]  Bo Zhao,et al.  Probabilistic topic models with biased propagation on heterogeneous information networks , 2011, KDD.

[16]  Ramesh Nallapati,et al.  Labeled LDA: A supervised topic model for credit attribution in multi-labeled corpora , 2009, EMNLP.

[17]  Chris Jermaine,et al.  Topic Models For Feature Selection in Document Clustering , 2013, SDM.

[18]  Thomas Hofmann,et al.  Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model , 2007 .

[19]  Huan Liu,et al.  Enhancing accessibility of microblogging messages using semantic knowledge , 2011, CIKM '11.

[20]  Xiaohui Yan,et al.  Clustering short text using Ncut-weighted non-negative matrix factorization , 2012, CIKM.

[21]  Timothy Baldwin,et al.  Automatic Evaluation of Topic Coherence , 2010, NAACL.

[22]  Pengtao Xie,et al.  Integrating Document Clustering and Topic Modeling , 2013, UAI.

[23]  Rong Pan,et al.  Tag-Weighted Topic Model for Mining Semi-Structured Documents , 2013, IJCAI.

[24]  Jiawei Han,et al.  Modeling hidden topics on document manifold , 2008, CIKM '08.

[25]  David M. Blei,et al.  Probabilistic topic models , 2012, Commun. ACM.

[26]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[27]  Zhoujun Li,et al.  Emerging topic detection for organizations from microblogs , 2013, SIGIR.

[28]  Thomas Hofmann,et al.  Probabilistic Latent Semantic Indexing , 1999, SIGIR Forum.

[29]  Susan T. Dumais,et al.  Characterizing Microblogs with Topic Models , 2010, ICWSM.

[30]  Brian D. Davison,et al.  Empirical study of topic modeling in Twitter , 2010, SOMA '10.

[31]  Michael S. Bernstein,et al.  Short and tweet: experiments on recommending content from information streams , 2010, CHI.

[32]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[33]  Andrew McCallum,et al.  Topic Models Conditioned on Arbitrary Features with Dirichlet-multinomial Regression , 2008, UAI.

[34]  Qi He,et al.  TwitterRank: finding topic-sensitive influential twitterers , 2010, WSDM '10.

[35]  Leysia Palen,et al.  Microblogging during two natural hazards events: what twitter may contribute to situational awareness , 2010, CHI.

[36]  Thomas L. Griffiths,et al.  Integrating Topics and Syntax , 2004, NIPS.

[37]  Xu Chen,et al.  The contextual focused topic model , 2012, KDD.

[38]  Wray L. Buntine,et al.  Topic Model : Extracting Product Opinions from Tweets by Leveraging Hashtags and Sentiment Lexicon , 2014 .

[39]  Maosong Sun,et al.  Tag-LDA for Scalable Real-time Tag Recommendation , 2009 .

[40]  Fernando Diaz,et al.  Time is of the essence: improving recency ranking using Twitter data , 2010, WWW '10.

[41]  Thomas L. Griffiths,et al.  Probabilistic author-topic models for information discovery , 2004, KDD.

[42]  Qi Gao,et al.  TUMS: Twitter-Based User Modeling Service , 2011, ESWC Workshops.

[43]  Rynson W. H. Lau,et al.  Knowledge and Data Engineering for e-Learning Special Issue of IEEE Transactions on Knowledge and Data Engineering , 2008 .