An improved Similarity Measure For Chinese Text Clustering

Similarity measure between documents is a pivotal step in text processing filed. Traditional similarity just considers one aspect of the text feature. A new similarity measure proposed in this paper takes statistics information and part of speech of feature terms into account. The proportion of statistics information and semantic, importance of different part of speech are obtained through experiment. K-means algorithm and its variants are widely used for text clustering, especially in large dataset. The choice of initial cluster centers is important, which can affect iterations and cluster quality. We proposed a new method based on previous researches. The method selects initial cluster center by combining maximum distance and statistical features. The experiments show that the improved method improves cluster quality in terms of F-measure, and has a less time consumption.