论文信息 - Chinese Text Clustering Algorithm Based k-means

Chinese Text Clustering Algorithm Based k-means

Abstract Text clustering is an important means and method in text mining. The process of Chinese text clustering based on k-means was emphasized, we found that new center of a cluster was easily effected by isolated text after some experiments. Average similarity of one cluster was used as a parameter, and multiplied it with a modulus between 0.75 and 1.25 to get the similarity threshold value, the texts whose similarity with original cluster center was greater than or equal to the threshold value ware collected as a candidate collection, then updated the cluster center with center of candidate collection. The experiments show that improved method averagely increased purity and F value about 10 percent over the original method.

Dechang Pi | Mingyu Yao | Xiangxiang Cong

[1] Wei-Ying Ma,et al. Learning to cluster web search results , 2004, SIGIR '04.

[2] Inderjit S. Dhillon,et al. Concept Decompositions for Large Sparse Text Data Using Clustering , 2004, Machine Learning.

[3] George Karypis,et al. Hierarchical Clustering Algorithms for Document Datasets , 2005, Data Mining and Knowledge Discovery.

[4] George Karypis,et al. A Comparison of Document Clustering Techniques , 2000 .

[5] David R. Karger,et al. Scatter/Gather: a cluster-based approach to browsing large document collections , 1992, SIGIR '92.

[6] Ge Yu,et al. Latent Concept Extraction and Text Clustering Based on Information Theory*: Latent Concept Extraction and Text Clustering Based on Information Theory* , 2008 .

[7] Gerard Salton,et al. A vector space model for automatic indexing , 1975, CACM.