Vector Space Model-Based Document Clustering Research

Document clustering plays an important role in web text mining, which is applied in the fields of text processing. In this paper, first introduces the Vector Space Model which is aiming at how to define documents as vectors (or points) in a multidimensional space. In order to improve the accuracy of similarity measurement for different documents, defines a more reasonable way to evaluate the weight of terms contained in certain document. Then, detailed analyzes the partitioning-based K-means algorithm which is widely used in document clustering. Considering that K-means has deficiency in selecting initial start points randomly, adopts the iterative max-min distance method combined with sampling techniques to optimize the initial clustering points selection, which contributes to improve the final clustering result.