Chinese Text Clustering Algorithm Based k-means

Abstract Text clustering is an important means and method in text mining. The process of Chinese text clustering based on k-means was emphasized, we found that new center of a cluster was easily effected by isolated text after some experiments. Average similarity of one cluster was used as a parameter, and multiplied it with a modulus between 0.75 and 1.25 to get the similarity threshold value, the texts whose similarity with original cluster center was greater than or equal to the threshold value ware collected as a candidate collection, then updated the cluster center with center of candidate collection. The experiments show that improved method averagely increased purity and F value about 10 percent over the original method.