Micro-blog topic detection method based on BTM topic model and K-means clustering algorithm

The development of micro-blog, generating large-scale short texts, provides people with convenient communication. In the meantime, discovering topics from short texts genuinely becomes an intractable problem. It was hard for traditional topic model-to-model short texts, such as probabilistic latent semantic analysis (PLSA) and Latent Dirichlet Allocation (LDA). They suffered from the severe data sparsity when disposed short texts. Moreover, K-means clustering algorithm can make topics discriminative when datasets is intensive and the difference among topic documents is distinct. In this paper, BTM topic model is employed to process short texts–micro-blog data for alleviating the problem of sparsity. At the same time, we integrating K-means clustering algorithm into BTM (Biterm Topic Model) for topics discovery further. The results of experiments on Sina micro-blog short text collections demonstrate that our method can discover topics effectively.

[1]  Li Liu,et al.  Combining parametric and nonparametric topic model to discover microblog event , 2014, 2014 International Conference on Information Science, Electronics and Electrical Engineering.

[2]  Thomas Hofmann,et al.  Probabilistic latent semantic indexing , 1999, SIGIR '99.

[3]  Jiang Hong,et al.  Improved LDA model for microblog topic mining , 2013 .

[4]  Xiaohui Yan,et al.  A biterm topic model for short texts , 2013, WWW.

[5]  Li Wang,et al.  Improved Text Clustering Algorithm and Application in Microblogging Public Opinion Analysis , 2013, 2013 Fourth World Congress on Software Engineering.

[6]  Dunlu Peng,et al.  Discovering Communities with Self-Adaptive k Clustering in Microblog Data , 2012, 2012 Second International Conference on Cloud and Green Computing.

[7]  Liu Ming,et al.  Discovering News Topics from Microblogs Based on Hidden Topics Analysis and Text Clustering , 2012 .

[8]  Mi Wen Microblog Hot Topics Discovery Method Based on Probabilistic Topic Model , 2014 .

[9]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[10]  R. Suganya,et al.  Data Mining Concepts and Techniques , 2010 .

[11]  Yitao Yang,et al.  Topic Detection from Microblog Based on Text Clustering and Topic Model Analysis , 2014, 2014 Asia-Pacific Services Computing Conference.