论文信息 - Short Text Topic Modeling Techniques, Applications, and Performance: A Survey

Short Text Topic Modeling Techniques, Applications, and Performance: A Survey

Analyzing short texts infers discriminative and coherent latent topics that is a critical and fundamental task since many real-world applications require semantic understanding of short texts. Traditional long text topic modeling algorithms (e.g., PLSA and LDA) based on word co-occurrences cannot solve this problem very well since only very limited word co-occurrence information is available in short texts. Therefore, short text topic modeling has already attracted much attention from the machine learning research community in recent years, which aims at overcoming the problem of sparseness in short texts. In this survey, we conduct a comprehensive review of various short text topic modeling techniques proposed in the literature. We present three categories of methods based on Dirichlet multinomial mixture, global word co-occurrences, and self-aggregation, with example of representative approaches in each category and analysis of their performance on various tasks. We develop the first comprehensive open-source library, called STTM, for use in Java that integrates all surveyed algorithms within a unified interface, benchmark datasets, to facilitate the expansion of new methods in this research field. Finally, we evaluate these state-of-the-art methods on many real-world datasets and compare their performance against one another and versus long text topic modeling algorithm.

[1] Gao Cong,et al. Topic-driven reader comments summarization , 2012, CIKM.

[2] Francis R. Bach,et al. Online Learning for Latent Dirichlet Allocation , 2010, NIPS.

[3] Kun He,et al. Learning Latent Topics from the Word Co-occurrence Network , 2017, NCTCS.

[4] Jianhua Yin,et al. A Text Clustering Algorithm Using an Online Clustering Scheme for Initialization , 2016, KDD.

[5] Michael S. Bernstein,et al. Short and tweet: experiments on recommending content from information streams , 2010, CHI.

[6] Diana Inkpen,et al. Text Representation Using Multi-level Latent Dirichlet Allocation , 2014, Canadian Conference on AI.

[7] Diyi Yang,et al. Incorporating Word Correlation Knowledge into Topic Modeling , 2015, NAACL.

[8] Dragomir R. Radev,et al. Effects of Creativity and Cluster Tightness on Short Text Clustering Performance , 2016, ACL.

[9] Jorge Nocedal,et al. On the limited memory BFGS method for large scale optimization , 1989, Math. Program..

[10] Qiang Yang,et al. Transferring topical knowledge from auxiliary long texts for short text clustering , 2011, CIKM '11.

[11] Charu C. Aggarwal,et al. Event Detection in Social Streams , 2012, SDM.

[12] Timothy Baldwin,et al. On-line Trend Analysis with Topic Models: #twitter Trends Detection Topic Model Online , 2012, COLING.

[13] Jeffrey Dean,et al. Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[14] Xindong Wu,et al. Topic Modeling over Short Texts by Incorporating Word Embeddings , 2016, PAKDD.

[15] Aixin Sun,et al. Enhancing Topic Modeling for Short Texts with Auxiliary Word Embeddings , 2017, ACM Trans. Inf. Syst..

[16] Guan Yu,et al. Document clustering via dirichlet process mixture model with feature selection , 2010, KDD.

[17] Ido Guy,et al. Social Recommender Systems , 2015, Recommender Systems Handbook.

[18] Thomas Hofmann,et al. Probabilistic latent semantic indexing , 1999, SIGIR '99.

[19] Richard Sproat,et al. Mining correlated bursty topic patterns from coordinated text streams , 2007, KDD '07.

[20] Jianyong Wang,et al. A dirichlet multinomial mixture model-based approach for short text clustering , 2014, KDD.

[21] Qi He,et al. TwitterRank: finding topic-sensitive influential twitterers , 2010, WSDM '10.

[22] Jiafeng Guo,et al. BTM: Topic Modeling over Short Texts , 2014, IEEE Transactions on Knowledge and Data Engineering.

[23] Xu-Ying Liu,et al. Crest: Cluster-based Representation Enrichment for Short Text Classification , 2013, PAKDD.

[24] Mengen Chen,et al. Short Text Classification Improved by Learning Multi-Granularity Topics , 2011, IJCAI.

[25] Xindong Wu,et al. Topic Discovery from Heterogeneous Texts , 2016, 2016 IEEE 28th International Conference on Tools with Artificial Intelligence (ICTAI).

[26] Jaegul Choo,et al. Short-Text Topic Modeling via Non-negative Matrix Factorization Enriched with Local Word-Context Correlations , 2018, WWW.

[27] Yan Zhang,et al. User Based Aggregation for Biterm Topic Model , 2015, ACL.

[28] Hong Cheng,et al. The dual-sparse topic model: mining focused topics and focused terms in short text , 2014, WWW.

[29] Michael I. Jordan,et al. Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[30] Bo Zhao,et al. PET: a statistical model for popular events tracking in social communities , 2010, KDD.

[31] Kenneth E. Shirley,et al. LDAvis: A method for visualizing and interpreting topics , 2014 .

[33] Barry Smyth,et al. Using twitter to recommend real-time topical news , 2009, RecSys '09.

[34] Timothy Baldwin,et al. Automatic Evaluation of Topic Coherence , 2010, NAACL.

[35] Scott Sanner,et al. Improving LDA topic models for microblogs via tweet pooling and automatic labeling , 2013, SIGIR.

[36] David B. Dunson,et al. Probabilistic topic models , 2012, Commun. ACM.

[37] Sebastian Thrun,et al. Text Classification from Labeled and Unlabeled Documents using EM , 2000, Machine Learning.

[38] Anísio Lacerda,et al. A general framework to expand short text for topic modeling , 2017, Inf. Sci..