Short Text Topic Modeling Techniques, Applications, and Performance: A Survey

Analyzing short texts infers discriminative and coherent latent topics that is a critical and fundamental task since many real-world applications require semantic understanding of short texts. Traditional long text topic modeling algorithms (e.g., PLSA and LDA) based on word co-occurrences cannot solve this problem very well since only very limited word co-occurrence information is available in short texts. Therefore, short text topic modeling has already attracted much attention from the machine learning research community in recent years, which aims at overcoming the problem of sparseness in short texts. In this survey, we conduct a comprehensive review of various short text topic modeling techniques proposed in the literature. We present three categories of methods based on Dirichlet multinomial mixture, global word co-occurrences, and self-aggregation, with example of representative approaches in each category and analysis of their performance on various tasks. We develop the first comprehensive open-source library, called STTM, for use in Java that integrates all surveyed algorithms within a unified interface, benchmark datasets, to facilitate the expansion of new methods in this research field. Finally, we evaluate these state-of-the-art methods on many real-world datasets and compare their performance against one another and versus long text topic modeling algorithm.

[1]  Gao Cong,et al.  Topic-driven reader comments summarization , 2012, CIKM.

[2]  Francis R. Bach,et al.  Online Learning for Latent Dirichlet Allocation , 2010, NIPS.

[3]  Kun He,et al.  Learning Latent Topics from the Word Co-occurrence Network , 2017, NCTCS.

[4]  Jianhua Yin,et al.  A Text Clustering Algorithm Using an Online Clustering Scheme for Initialization , 2016, KDD.

[5]  Michael S. Bernstein,et al.  Short and tweet: experiments on recommending content from information streams , 2010, CHI.

[6]  Diana Inkpen,et al.  Text Representation Using Multi-level Latent Dirichlet Allocation , 2014, Canadian Conference on AI.

[7]  Diyi Yang,et al.  Incorporating Word Correlation Knowledge into Topic Modeling , 2015, NAACL.

[8]  Dragomir R. Radev,et al.  Effects of Creativity and Cluster Tightness on Short Text Clustering Performance , 2016, ACL.

[9]  Jorge Nocedal,et al.  On the limited memory BFGS method for large scale optimization , 1989, Math. Program..

[10]  Qiang Yang,et al.  Transferring topical knowledge from auxiliary long texts for short text clustering , 2011, CIKM '11.

[11]  Charu C. Aggarwal,et al.  Event Detection in Social Streams , 2012, SDM.

[12]  Timothy Baldwin,et al.  On-line Trend Analysis with Topic Models: #twitter Trends Detection Topic Model Online , 2012, COLING.

[13]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[14]  Xindong Wu,et al.  Topic Modeling over Short Texts by Incorporating Word Embeddings , 2016, PAKDD.

[15]  Aixin Sun,et al.  Enhancing Topic Modeling for Short Texts with Auxiliary Word Embeddings , 2017, ACM Trans. Inf. Syst..

[16]  Guan Yu,et al.  Document clustering via dirichlet process mixture model with feature selection , 2010, KDD.

[17]  Ido Guy,et al.  Social Recommender Systems , 2015, Recommender Systems Handbook.

[18]  Thomas Hofmann,et al.  Probabilistic latent semantic indexing , 1999, SIGIR '99.

[19]  Richard Sproat,et al.  Mining correlated bursty topic patterns from coordinated text streams , 2007, KDD '07.

[20]  Jianyong Wang,et al.  A dirichlet multinomial mixture model-based approach for short text clustering , 2014, KDD.

[21]  Qi He,et al.  TwitterRank: finding topic-sensitive influential twitterers , 2010, WSDM '10.

[22]  Jiafeng Guo,et al.  BTM: Topic Modeling over Short Texts , 2014, IEEE Transactions on Knowledge and Data Engineering.

[23]  Xu-Ying Liu,et al.  Crest: Cluster-based Representation Enrichment for Short Text Classification , 2013, PAKDD.

[24]  Mengen Chen,et al.  Short Text Classification Improved by Learning Multi-Granularity Topics , 2011, IJCAI.

[25]  Xindong Wu,et al.  Topic Discovery from Heterogeneous Texts , 2016, 2016 IEEE 28th International Conference on Tools with Artificial Intelligence (ICTAI).

[26]  Jaegul Choo,et al.  Short-Text Topic Modeling via Non-negative Matrix Factorization Enriched with Local Word-Context Correlations , 2018, WWW.

[27]  Yan Zhang,et al.  User Based Aggregation for Biterm Topic Model , 2015, ACL.

[28]  Hong Cheng,et al.  The dual-sparse topic model: mining focused topics and focused terms in short text , 2014, WWW.

[29]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[30]  Bo Zhao,et al.  PET: a statistical model for popular events tracking in social communities , 2010, KDD.

[31]  Kenneth E. Shirley,et al.  LDAvis: A method for visualizing and interpreting topics , 2014 .

[33]  Barry Smyth,et al.  Using twitter to recommend real-time topical news , 2009, RecSys '09.

[34]  Timothy Baldwin,et al.  Automatic Evaluation of Topic Coherence , 2010, NAACL.

[35]  Scott Sanner,et al.  Improving LDA topic models for microblogs via tweet pooling and automatic labeling , 2013, SIGIR.

[36]  David B. Dunson,et al.  Probabilistic topic models , 2012, Commun. ACM.

[37]  Sebastian Thrun,et al.  Text Classification from Labeled and Unlabeled Documents using EM , 2000, Machine Learning.

[38]  Anísio Lacerda,et al.  A general framework to expand short text for topic modeling , 2017, Inf. Sci..

[39]  Hongfei Yan,et al.  Comparing Twitter and Traditional Media Using Topic Models , 2011, ECIR.

[40]  Wei Zhang,et al.  Model-based Clustering of Short Text Streams , 2018, KDD.

[41]  Yuan Zuo,et al.  Word network topic model: a simple but general solution for short and imbalanced texts , 2014, Knowledge and Information Systems.

[42]  Dat Quoc Nguyen,et al.  Improving Topic Models with Latent Feature Word Representations , 2015, TACL.

[43]  Susan T. Dumais,et al.  Characterizing Microblogs with Topic Models , 2010, ICWSM.

[44]  Cyrus Rashtchian,et al.  Collecting Image Annotations Using Amazon’s Mechanical Turk , 2010, Mturk@HLT-NAACL.

[45]  Jeffrey Heer,et al.  Termite: visualization techniques for assessing textual topic models , 2012, AVI.

[46]  Aixin Sun,et al.  Topic Modeling for Short Texts with Auxiliary Word Embeddings , 2016, SIGIR.

[47]  Susumu Horiguchi,et al.  Learning to classify short and sparse text & web with hidden topics from large-scale data collections , 2008, WWW.

[48]  Pengtao Xie,et al.  Integrating Document Clustering and Topic Modeling , 2013, UAI.

[49]  Hui Xiong,et al.  Topic Modeling of Short Texts: A Pseudo-Document View , 2016, KDD.

[50]  Xindong Wu,et al.  Short text clustering based on Pitman-Yor process mixture model , 2018, Applied Intelligence.

[51]  Hosam M. Mahmoud,et al.  Polya Urn Models , 2008 .

[52]  Xiaohui Yan,et al.  A Probabilistic Model for Bursty Topic Discovery in Microblogs , 2015, AAAI.

[53]  Sinno Jialin Pan,et al.  Short and Sparse Text Topic Modeling via Self-Aggregation , 2015, IJCAI.

[54]  Hakan Ferhatosmanoglu,et al.  Short text classification in twitter to improve information filtering , 2010, SIGIR.

[55]  Andrew McCallum,et al.  Optimizing Semantic Coherence in Topic Models , 2011, EMNLP.

[56]  Thomas L. Griffiths,et al.  The Author-Topic Model for Authors and Documents , 2004, UAI.

[57]  Jun Zhang,et al.  Dirichlet Process Mixture Model for Document Clustering with Feature Partition , 2013, IEEE Transactions on Knowledge and Data Engineering.

[58]  Tao Mei,et al.  Personalized Recommendation Combining User Interest and Social Circle , 2014, IEEE Transactions on Knowledge and Data Engineering.

[59]  Chong Wang,et al.  Reading Tea Leaves: How Humans Interpret Topic Models , 2009, NIPS.

[60]  Duc-Thuan Vo,et al.  Learning to classify short text from scientific documents using topic models with various types of knowledge , 2015, Expert Syst. Appl..

[61]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[62]  Brian D. Davison,et al.  Empirical study of topic modeling in Twitter , 2010, SOMA '10.