论文信息 - An Efficient Clustering Algorithm for Small Text Documents

An Efficient Clustering Algorithm for Small Text Documents

Clustering text documents into different category groups is an important problem. The size of desired clusters is an important requirement for a clustering solution. In this paper, we present an efficient clustering algorithm called RTC based on the spherical k-means algorithm for small text documents. In RTC, we present a new initial centers choice method based on the density and farthest distance strategies. Based on the first variations adjustment of Ping-Pong algorithm, we also present a new partition adjustment method, which is guided by the set of border objects of clusters. We test the algorithm performance based on the Chinese natural language platform. The experimental results show that RTC outperforms the spherical k-means and bisecting k-means in clustering accuracy and Ping-Pong both in clustering accuracy and clustering time. Especially, in the clustering time aspect, RTC sometimes is 5 times faster than Ping- Pong.

Jian Yin | Yubao Liu | Jiarong Cai | Zhilan Huang

[1] Shokri Z. Selim,et al. K-Means-Type Algorithms: A Generalized Convergence Theorem and Characterization of Local Optimality , 1984, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2] Inderjit S. Dhillon,et al. Refining clusters in high dimensional text data , 2003 .

[3] Edie M. Rasmussen,et al. Clustering Algorithms , 1992, Information Retrieval: Data Structures & Algorithms.

[4] R. Mooney,et al. Impact of Similarity Measures on Web-page Clustering , 2000 .

[5] George Karypis,et al. A Comparison of Document Clustering Techniques , 2000 .

[6] Hans-Peter Kriegel,et al. A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[7] David B. Shmoys,et al. A Best Possible Heuristic for the k-Center Problem , 1985, Math. Oper. Res..

[8] Inderjit S. Dhillon,et al. Concept Decompositions for Large Sparse Text Data Using Clustering , 2004, Machine Learning.

[9] Yoshua Bengio,et al. Convergence Properties of the K-Means Algorithms , 1994, NIPS.

[10] E. Forgy,et al. Cluster analysis of multivariate data : efficiency versus interpretability of classifications , 1965 .