论文信息 - An Improved KNN Text Classification Algorithm Based on Clustering

An Improved KNN Text Classification Algorithm Based on Clustering

The traditional KNN text classification algorithm used all training samples for classification, so it had a huge number of training samples and a high degree of calculation complexity, and it also didn’t reflect the different importance of different samples. In allusion to the problems mentioned above, an improved KNN text classification algorithm based on clustering center is proposed in this paper. Firstly, the given training sets are compressed and the samples near by the border are deleted, so the multipeak effect of the training sample sets is eliminated. Secondly, the training sample sets of each category are clustered by k-means clustering algorithm, and all cluster centers are taken as the new training samples. Thirdly, a weight value is introduced, which indicates the importance of each training sample according to the number of samples in the cluster that contains this cluster center. Finally, the modified samples are used to accomplish KNN text classification. The simulation results show that the algorithm proposed in this paper can not only effectively reduce the actual number of training samples and lower the calculation complexity, but also improve the accuracy of KNN text classification algorithm.

[1] Zuo Wan. A Clustering Algorithm Using Dynamic Nearest Neighbors Selection Model , 2007 .

[2] Xihong Wu,et al. Improving Chinese text categorization by outlier learning , 2005, 2005 International Conference on Natural Language Processing and Knowledge Engineering.

[3] Yiming Yang,et al. A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[4] Belur V. Dasarathy,et al. Nearest neighbor (NN) norms: NN pattern classification techniques , 1991 .

[5] Yu Wang,et al. A Fast KNN Algorithm for Text Categorization , 2007, 2007 International Conference on Machine Learning and Cybernetics.

[6] Zhang Xiao. Vector-Combination-Applied KNN Method for Chinese Text Categorization , 2004 .

[7] Wang Hongwei,et al. A Simple and Efficient Algorithm to Classify a Large Scale of Texts , 2005 .

[8] Xu Xin,et al. Advances in Machine Learning Based Text Categorization , 2006 .

[9] Fabrizio Sebastiani,et al. Machine learning in automated text categorization , 2001, CSUR.

[10] Lu Yu. ANALYSIS AND CONSTRUCTION OF WORD WEIGHING FUNCTION IN VSM , 2002 .