论文信息 - Research on Text Categorization of KNN Based on K-Means for Class Imbalanced Problem

Research on Text Categorization of KNN Based on K-Means for Class Imbalanced Problem

With the rapid development of Web and the rapid expansion of text information, how to effectively organize and manage these information is a great challenge for the current information science. Text automatic classification technology can effectively organize a large number of texts and help people to improve the efficiency of information retrieval. It has become one of the most important research directions in the field of information processing. There are many mature methods of text classification, where K-Nearest Neighbor algorithm has good accuracy, it is suitable for multiple classification problems and has been widely used in the field of document classification. However, when dealing with the training set with class imbalanced problem, the classification results tend to be biased towards majority class, so that the accuracy of the classifier is greatly reduced. In order to solve this problem, two strategies that construction of samples based on clustering and weighted KNN based on sample density are proposed in this paper to improve the traditional KNN algorithm. Four datasets which have different class imbalanced rates are extracted from the entire corpus, and we use classic KNN, NWKNN and Kmeans-KNN algorithm to perform cross validation on each dataset. The results show that compared with the traditional KNN algorithm and NWKNN algorithm, the proposed method can effectively improve the classification accuracy and G-mean value, and has better stability under the class imbalanced problem.

Wang Yu | Xu Linying | Wang Yu | Xu Linying

[1] Edward Y. Chang,et al. Class-Boundary Alignment for Imbalanced Dataset Learning , 2003 .

[2] Hu Yunfa,et al. A Strategy to Class Imbalance Problem for kNN Text Classifier , 2009 .

[3] Gustavo E. A. P. A. Batista,et al. A study of the behavior of several methods for balancing machine learning training data , 2004, SKDD.

[4] Zhang Xue-ren. An Improved Density-Based KNN Algorithm under Clustering , 2011 .

[5] Herna L. Viktor,et al. Learning from imbalanced data sets with boosting and data generation: the DataBoost-IM approach , 2004, SKDD.

[6] Salvatore J. Stolfo,et al. AdaCost: Misclassification Cost-Sensitive Boosting , 1999, ICML.

[7] Su Zhan. An Improved KNN Text Categorization Method Based on Data Uneven , 2010 .

[8] Songbo Tan,et al. Neighbor-weighted K-nearest neighbor for unbalanced text corpus , 2005, Expert Syst. Appl..

[9] Xu Xin,et al. Advances in Machine Learning Based Text Categorization , 2006 .

[10] Zhang Tao,et al. Classification for Imbalanced Dataset of Improved Weighted KNN Algorithm , 2012 .