A Novel Hybrid Clustering Algorithm for Topic Detection on Chinese Microblogging

The hot topics discussed on microblogs mirror public opinion, so the topic detection on microblogs is of great significance for the detection and management of public opinion. However, it is difficult for traditional clustering algorithms to handle the large-scale microblogging data with various topics and high noise. Therefore, we propose a three-layer hybrid algorithm to tackle this problem. In the first layer, we use the <inline-formula> <tex-math notation="LaTeX">$K$ </tex-math></inline-formula>-means algorithm, in which the initial center selection optimized to group the microblog texts efficiently. We then subdivide big clusters and isolate noise text to get purer clusters. In the second layer, we adopt the agglomerative nesting (AGNES) algorithm to merge the small clusters referring to the same topic. Then, we exclude most noise, reducing their further impact on the <inline-formula> <tex-math notation="LaTeX">$K$ </tex-math></inline-formula>-means in the third layer which corrects the erroneous merging occurring in AGNES. Experiments show that our algorithm outperforms some related traditional algorithms on the clustering of real microblogging data set and performs well in the topic detection.

[1]  Saurabh Kataria,et al.  Supervised Topic Models for Microblog Classification , 2015, 2015 IEEE International Conference on Data Mining.

[2]  Bo Hu,et al.  An Improved Single-Pass Algorithm for Chinese Microblog Topic Detection and Tracking , 2016, 2016 IEEE International Congress on Big Data (BigData Congress).

[3]  Amy V Kapp,et al.  Are clusters found in one dataset present in another dataset? , 2007, Biostatistics.

[4]  Isabelle Guyon,et al.  An Introduction to Feature Extraction , 2006, Feature Extraction.

[5]  Wray L. Buntine,et al.  Twitter-Network Topic Model: A Full Bayesian Treatment for Social Network and Text Modeling , 2016, ArXiv.

[6]  James Allan,et al.  Taking Topic Detection From Evaluation to Practice , 2005, Proceedings of the 38th Annual Hawaii International Conference on System Sciences.

[7]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[8]  Zhengtao Yu,et al.  Micro-blog topic detection method based on BTM topic model and K-means clustering algorithm , 2016, Automatic Control and Computer Sciences.

[9]  Hendri Murfi,et al.  Combination of singular value decomposition and K-means clustering methods for topic detection on Twitter , 2015, 2015 International Conference on Advanced Computer Science and Information Systems (ICACSIS).

[10]  Chen Zhang,et al.  A hybrid term-term relations analysis approach for topic detection , 2016, Knowl. Based Syst..

[11]  Mario Cataldi,et al.  Emerging topic detection on Twitter based on temporal and social terms evaluation , 2010, MDMKDD '10.

[12]  Wang Chunlon Improved K-means algorithm based on latent Dirichlet allocation for text clustering , 2014 .

[13]  J. A. Hartigan,et al.  A k-means clustering algorithm , 1979 .

[14]  Hua Zhao,et al.  Chinese Microblog Topic Detection Based on the Latent Semantic Analysis and Structural Property , 2013, J. Networks.

[15]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[16]  Boudewijn P. F. Lelieveldt,et al.  A new cluster validity index for the fuzzy c-mean , 1998, Pattern Recognit. Lett..

[17]  Igor Brigadir,et al.  Event Detection in Twitter using Aggressive Filtering and Hierarchical Tweet Clustering , 2014, SNOW-DC@WWW.

[18]  Ali Ridho Barakbah,et al.  Hierarchical K-means: an algorithm for centroids initialization for K-means , 2007 .

[19]  Yiming Yang,et al.  Topic Detection and Tracking Pilot Study Final Report , 1998 .

[20]  George Karypis,et al.  A Comparison of Document Clustering Techniques , 2000 .

[21]  Kui Meng,et al.  An Improved Topic Detection Method for Chinese Microblog Based On Incremental Clustering , 2013, J. Softw..

[22]  Shaopeng Liu,et al.  Topic Detection in Chinese Microblogs Using Hot Term Discovery and Adaptive Spectral Clustering , 2014, 2014 Ninth International Conference on P2P, Parallel, Grid, Cloud and Internet Computing.

[23]  Xiaohui Yan,et al.  A biterm topic model for short texts , 2013, WWW.

[24]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[25]  Liu Ming,et al.  Discovering News Topics from Microblogs Based on Hidden Topics Analysis and Text Clustering , 2012 .

[26]  Jun Zheng,et al.  A hot topic detection method for Chinese Microblog based on topic words , 2014, Proceedings of 2nd International Conference on Information Technology and Electronic Commerce.

[27]  Yiannis Kompatsiaris,et al.  Two-level Message Clustering for Topic Detection in Twitter , 2014, SNOW-DC@WWW.

[28]  Xin Chen,et al.  Detecting Hot Topics in Sina Weibo Based on Opinion Leaders , 2014, INFOCOM 2014.