Improving K-nearest neighbor efficiency for text categorization

With the increasing use of the Internet and electronic documents, automatic text categorization becomes imperative. Many classification methods have been applied to text categorization. The k-nearest neighbors (k-NN) is known to be one of the best state of the art classifiers when used for text categorization. However, k-NN suffers from limitations such as high computation, low tolerance to noise, and its dependency to the parameter k and distance function. In this paper, we first survey some improvements algorithms proposed in the literature to face those shortcomings. And second, we discuss an approach to improve k-NN efficiency without degrading the performance of classification. Experimental results on the 20Newsgroup and Reuters corpora show that the proposed approach increases the performance of k-NN and reduces the time classification.

[1]  Ahmed Hassan Awadallah,et al.  Improved Nearest Neighbor Methods For Text Classification With Language Modeling and Harmonic Functions , 2008 .

[2]  Andrew W. Moore,et al.  New Algorithms for Efficient High-Dimensional Nonparametric Classification , 2006, J. Mach. Learn. Res..

[3]  Kyo Kageura,et al.  Virtual relevant documents in text categorization with support vector machines , 2007, Inf. Process. Manag..

[4]  S. Appavu alias Balamurugan,et al.  Knowledge-based system for text classification using ID6NB algorithm , 2009, Knowl. Based Syst..

[5]  Abdellatif Rahmoun,et al.  Using WordNet for Text Categorization , 2008, Int. Arab J. Inf. Technol..

[6]  Bouziane Beldjilali,et al.  Using cellular automata for improving knn based spam filtering , 2014, Int. Arab J. Inf. Technol..

[7]  Xuesong Yan Weighted K-Nearest Neighbor Classification Algorithm Based on Genetic Algorithm , 2013 .

[8]  Songbo Tan,et al.  Neighbor-weighted K-nearest neighbor for unbalanced text corpus , 2005, Expert Syst. Appl..

[9]  Min Du,et al.  Accelerated k-nearest neighbors algorithm based on principal component analysis for text categorization , 2013, Journal of Zhejiang University SCIENCE C.

[10]  Bouziane Beldjilali,et al.  Knowledge Discovery in Database: Induction Graph and Cellular Automaton , 2007, Comput. Informatics.

[11]  Wai Lam,et al.  Using a generalized instance set for automatic text categorization , 1998, SIGIR '98.

[12]  Muhammed Miah Improved k-NN Algorithm for Text Classification , 2009, DMIN.

[13]  Songbo Tan,et al.  An effective refinement strategy for KNN text classifier , 2006, Expert Syst. Appl..

[14]  Stephen Wolfram,et al.  Cellular Automata And Complexity , 1994 .

[15]  B. Schönfisch,et al.  Synchronous and asynchronous updating in cellular automata. , 1999, Bio Systems.

[16]  Shengyi Jiang,et al.  An improved K-nearest-neighbor algorithm for text categorization , 2012, Expert Syst. Appl..

[17]  Ajith Abraham,et al.  Improving kNN Text Categorization by Removing Outliers from Training Set , 2006, CICLing.

[18]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[19]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[20]  K. Thanushkodi,et al.  An Improved k-Nearest Neighbor Classification Using Genetic Algorithm , 2010 .

[21]  Shengyi Jiang,et al.  A generalized cluster centroid based classifier for text categorization , 2013, Inf. Process. Manag..

[22]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[23]  S. Niharika,et al.  A SURVEY ON TEXT CATEGORIZATION , 2012 .

[24]  Dianhong Wang,et al.  Survey of Improving K-Nearest-Neighbor for Classification , 2007, Fourth International Conference on Fuzzy Systems and Knowledge Discovery (FSKD 2007).

[25]  Cheng Hua Li,et al.  Combination of modified BPNN algorithms and an efficient feature selection method for text categorization , 2009, Inf. Process. Manag..

[26]  J. S. Dhobi,et al.  Improved kNN Algorithm by Optimizing Cross-validation , 2012 .

[27]  Shixiong Xia,et al.  An Improved KNN Text Classification Algorithm Based on Clustering , 2009, J. Comput..

[28]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[29]  Francisco Herrera,et al.  Prototype Selection for Nearest Neighbor Classification: Taxonomy and Empirical Study , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[30]  Peter E. Hart,et al.  Nearest neighbor pattern classification , 1967, IEEE Trans. Inf. Theory.

[31]  Eibe Frank,et al.  Naive Bayes for Text Classification with Unbalanced Classes , 2006, PKDD.

[32]  Barigou Naouel,et al.  A boolean model for spam detection , 2011, 2011 International Conference on Communications, Computing and Control Applications (CCCA).

[33]  Shuigeng Zhou,et al.  Pruning Training Corpus to Speedup Text Classification , 2002, DEXA.