Study on semantic-based Chinese text optimal number of clusters

The effect of the cluster numbers on the large sample data cluster is analyzed, and some prevailing ideas of measurement index for the clustering quality are expounded. The optimal class number of text semantic are studied by the concept of text similarity, and an optimal number of clusters algorithm CTBP in clustering process is presented, and the main idea is to extract a word in each text vector and came into being ordered to array with text similarity, and the class number in optimal dividing has been used to get from the increment which is divided layer by layer. Statistical information can get from using scanning the data a time, and finally obtained the optimal solution. The experimental result shows that our method is helpful to develop speed and quality.