An adaptive k -nearest neighbor text categorization strategy

k is the most important parameter in a text categorization system based on the k-nearest neighbor algorithm (kNN). To classify a new document, the k-nearest documents in the training set are determined first. The prediction of categories for this document can then be made according to the category distribution among the k nearest neighbors. Generally speaking, the class distribution in a training set is not even; some classes may have more samples than others. The system's performance is very sensitive to the choice of the parameter k. And it is very likely that a fixed k value will result in a bias for large categories, and will not make full use of the information in the training set. To deal with these problems, an improved kNN strategy, in which different numbers of nearest neighbors for different categories are used instead of a fixed number across all categories, is proposed in this article. More samples (nearest neighbors) will be used to decide whether a test document should be classified in a category that has more samples in the training set. The numbers of nearest neighbors selected for different categories are adaptive to their sample size in the training set. Experiments on two different datasets show that our methods are less sensitive to the parameter k than the traditional ones, and can properly classify documents belonging to smaller classes with a large k. The strategy is especially applicable and promising for cases where estimating the parameter k via cross-validation is not possible and the class distribution of a training set is skewed.

[1]  Vipin Kumar,et al.  Text Categorization Using Weight Adjusted k-Nearest Neighbor Classification , 2001, PAKDD.

[2]  Yiming Yang,et al.  An example-based mapping method for text categorization and retrieval , 1994, TOIS.

[3]  Yiming Yang,et al.  An Evaluation of Statistical Approaches to Text Categorization , 1999, Information Retrieval.

[4]  Ken Lang,et al.  NewsWeeder: Learning to Filter Netnews , 1995, ICML.

[5]  David L. Waltz,et al.  Classifying news stories using memory based reasoning , 1992, SIGIR '92.

[6]  Ah-Hwee Tan,et al.  Machine Learning Methods for Chinese Web page Categorization , 2000, ACL 2000.

[7]  M. F. Porter,et al.  An algorithm for suffix stripping , 1997 .

[8]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[9]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[10]  Yiming Yang,et al.  A re-examination of text categorization methods , 1999, SIGIR '99.

[11]  Belur V. Dasarathy,et al.  Nearest neighbor (NN) norms: NN pattern classification techniques , 1991 .

[12]  Gerard Salton,et al.  Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer , 1989 .

[13]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[14]  Yiming Yang,et al.  Improving text categorization methods for event tracking , 2000, SIGIR '00.

[15]  James Allan,et al.  Topic detection and tracking: event-based information organization , 2002 .

[16]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[17]  Arlindo L. Oliveira,et al.  An Empirical Comparison of Text Categorization Methods , 2003, SPIRE.

[18]  Yiming Yang,et al.  Expert network: effective and efficient learning from human decisions in text categorization and retrieval , 1994, SIGIR '94.