论文信息 - Chinese Web Short Text Subject Clustering Based on Similarity Upper Approximation

Chinese Web Short Text Subject Clustering Based on Similarity Upper Approximation

In this paper, we propose a Web short text clustering method based on altered Similarity Upper Approximation algorithm. After the initial text modeling, we reduce the dimension of the text feature word matrix by singular value decomposition. After the clustering is completed, we extract the most frequent words in each text cluster to represent the subject of each cluster. The clustering process does not need to specify the number of clusters in advance, and it is suitable for Web short text clustering that is constantly updated and can not know the specific number of clusters in advance. In order to make the cluster number more accurate, we proposed to add the merger of clusters based on the average similarity of clusters and outlier detection in the original algorithm. Experiments show that the altered algorithm proposed in this paper is superior to the K-means algorithm and the hierarchical clustering algorithm in clustering accuracy and more accurate to original algorithm in cluster number.

YunHua Zhang | JiaWei Zhu

[1] Joshua Zhexue Huang,et al. Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values , 1998, Data Mining and Knowledge Discovery.

[2] Ari Rappoport,et al. Efficient Clustering of Short Messages into General Domains , 2013, ICWSM.

[3] J. MacQueen. Some methods for classification and analysis of multivariate observations , 1967 .

[4] Tian Zhang,et al. BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[5] Huiying Wang,et al. Study on frequent term set-based hierarchical clustering algorithm , 2011, 2011 Eighth International Conference on Fuzzy Systems and Knowledge Discovery (FSKD).

[6] Pushpak Bhattacharyya,et al. TwiSent: a multistage system for analyzing sentiment in twitter , 2012, CIKM '12.

[7] Zhang Chun-ping,et al. Research on K-means Clustering Algorithm , 2011 .

[8] Xijin Tang,et al. Text clustering using frequent itemsets , 2010, Knowl. Based Syst..

[9] Anatole Gershman,et al. Topical Clustering of Tweets , 2011 .

[10] Pradeep Kumar,et al. Clustering using Similarity Upper Approximation , 2006, 2006 IEEE International Conference on Fuzzy Systems.