Chinese Web Short Text Subject Clustering Based on Similarity Upper Approximation

In this paper, we propose a Web short text clustering method based on altered Similarity Upper Approximation algorithm. After the initial text modeling, we reduce the dimension of the text feature word matrix by singular value decomposition. After the clustering is completed, we extract the most frequent words in each text cluster to represent the subject of each cluster. The clustering process does not need to specify the number of clusters in advance, and it is suitable for Web short text clustering that is constantly updated and can not know the specific number of clusters in advance. In order to make the cluster number more accurate, we proposed to add the merger of clusters based on the average similarity of clusters and outlier detection in the original algorithm. Experiments show that the altered algorithm proposed in this paper is superior to the K-means algorithm and the hierarchical clustering algorithm in clustering accuracy and more accurate to original algorithm in cluster number.