Topic Modeling of Short Texts: A Pseudo-Document View

Recent years have witnessed the unprecedented growth of online social media, which empower short texts as the prevalent format for information of Internet. Given the nature of sparsity, however, short text topic modeling remains a critical yet much-watched challenge in both academy and industry. Rich research efforts have been put on building different types of probabilistic topic models for short texts, among which the self aggregation methods without using auxiliary information become an emerging solution for providing informative cross-text word co-occurrences. However, models along this line are still rarely seen, and the representative one Self-Aggregation Topic Model (SATM) is prone to overfitting and computationally expensive. In light of this, in this paper, we propose a novel probabilistic model called Pseudo-document-based Topic Model (PTM) for short text topic modeling. PTM introduces the concept of pseudo document to implicitly aggregate short texts against data sparsity. By modeling the topic distributions of latent pseudo documents rather than short texts, PTM is expected to gain excellent performance in both accuracy and efficiency. A Sparsity-enhanced PTM (SPTM for short) is also proposed by applying Spike and Slab prior, with the purpose of eliminating undesired correlations between pseudo documents and latent topics. Extensive experiments on various real-world data sets with state-of-the-art baselines demonstrate the high quality of topics learned by PTM and its robustness with reduced training samples. It is also interesting to show that i) SPTM gains a clear edge over PTM when the number of pseudo documents is relatively small, and ii) the constraint that a short text belongs to only one pseudo document is critically important for the success of PTM. We finally take an in-depth semantic analysis to unveil directly the fabulous function of pseudo documents in finding cross-text word co-occurrences for topic modeling.

[1]  Heng Ji,et al.  Harnessing web page directories for large-scale classification of tweets , 2013, WWW '13 Companion.

[2]  Yue Lu,et al.  Investigating task performance of probabilistic topic models: an empirical study of PLSA and LDA , 2011, Information Retrieval.

[3]  Andrew McCallum,et al.  Efficient methods for topic model inference on streaming document collections , 2009, KDD.

[4]  Xianglong Liu,et al.  Structure Sensitive Hashing With Adaptive Product Quantization , 2016, IEEE Transactions on Cybernetics.

[5]  Yuan Zuo,et al.  Word network topic model: a simple but general solution for short and imbalanced texts , 2014, Knowledge and Information Systems.

[6]  Hanna Wallach,et al.  Structured Topic Models for Language , 2008 .

[7]  Qiang Yang,et al.  Transferring topical knowledge from auxiliary long texts for short text clustering , 2011, CIKM '11.

[8]  Timothy Baldwin,et al.  Automatic Evaluation of Topic Coherence , 2010, NAACL.

[9]  Alexander J. Smola,et al.  Reducing the sampling complexity of topic models , 2014, KDD.

[10]  Brian D. Davison,et al.  Empirical study of topic modeling in Twitter , 2010, SOMA '10.

[11]  Acknowledgments , 2006, Molecular and Cellular Endocrinology.

[12]  Fei Wang,et al.  ET-LDA: Joint Topic Modeling for Aligning Events and their Twitter Feedback , 2012, AAAI.

[13]  Jian-Yun Nie,et al.  Originator or propagator?: incorporating social role theory into topic models for twitter content analysis , 2013, CIKM.

[14]  Hongfei Yan,et al.  Comparing Twitter and Traditional Media Using Topic Models , 2011, ECIR.

[15]  Qiaozhu Mei,et al.  One theme in all views: modeling consensus topics in multiple contexts , 2013, KDD.

[16]  Qiaozhu Mei,et al.  Understanding the Limiting Factors of Topic Modeling via Posterior Contraction Analysis , 2014, ICML.

[17]  Michael I. Jordan,et al.  Hierarchical Dirichlet Processes , 2006 .

[18]  Andrew McCallum,et al.  Topics over time: a non-Markov continuous-time model of topical trends , 2006, KDD '06.

[19]  Scott Sanner,et al.  Improving LDA topic models for microblogs via tweet pooling and automatic labeling , 2013, SIGIR.

[20]  Sebastian Thrun,et al.  Text Classification from Labeled and Unlabeled Documents using EM , 2000, Machine Learning.

[21]  Sinno Jialin Pan,et al.  Short and Sparse Text Topic Modeling via Self-Aggregation , 2015, IJCAI.

[22]  Andrew McCallum,et al.  Optimizing Semantic Coherence in Topic Models , 2011, EMNLP.

[23]  Thomas Hofmann,et al.  Probabilistic Latent Semantic Indexing , 1999, SIGIR Forum.

[24]  Qi He,et al.  TwitterRank: finding topic-sensitive influential twitterers , 2010, WSDM '10.

[25]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[26]  Chong Wang,et al.  Decoupling Sparsity and Smoothness in the Discrete Hierarchical Dirichlet Process , 2009, NIPS.

[27]  Pengtao Xie,et al.  Integrating Document Clustering and Topic Modeling , 2013, UAI.

[28]  Jianyong Wang,et al.  A dirichlet multinomial mixture model-based approach for short text clustering , 2014, KDD.

[29]  Hong Cheng,et al.  The dual-sparse topic model: mining focused topics and focused terms in short text , 2014, WWW.

[30]  Xiaohui Yan,et al.  A biterm topic model for short texts , 2013, WWW.

[31]  David M. Blei,et al.  Probabilistic topic models , 2012, Commun. ACM.

[32]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[33]  Wei Li,et al.  Pachinko allocation: DAG-structured mixture models of topic correlations , 2006, ICML.

[34]  J. S. Rao,et al.  Spike and slab variable selection: Frequentist and Bayesian strategies , 2005, math/0505633.