Research on Text Categorization of KNN Based on K-Means for Class Imbalanced Problem

With the rapid development of Web and the rapid expansion of text information, how to effectively organize and manage these information is a great challenge for the current information science. Text automatic classification technology can effectively organize a large number of texts and help people to improve the efficiency of information retrieval. It has become one of the most important research directions in the field of information processing. There are many mature methods of text classification, where K-Nearest Neighbor algorithm has good accuracy, it is suitable for multiple classification problems and has been widely used in the field of document classification. However, when dealing with the training set with class imbalanced problem, the classification results tend to be biased towards majority class, so that the accuracy of the classifier is greatly reduced. In order to solve this problem, two strategies that construction of samples based on clustering and weighted KNN based on sample density are proposed in this paper to improve the traditional KNN algorithm. Four datasets which have different class imbalanced rates are extracted from the entire corpus, and we use classic KNN, NWKNN and Kmeans-KNN algorithm to perform cross validation on each dataset. The results show that compared with the traditional KNN algorithm and NWKNN algorithm, the proposed method can effectively improve the classification accuracy and G-mean value, and has better stability under the class imbalanced problem.