An improved K-nearest-neighbor algorithm for text categorization

Text categorization is a significant tool to manage and organize the surging text data. Many text categorization algorithms have been explored in previous literatures, such as KNN, Naive Bayes and Support Vector Machine. KNN text categorization is an effective but less efficient classification method. In this paper, we propose an improved KNN algorithm for text categorization, which builds the classification model by combining constrained one pass clustering algorithm and KNN text categorization. Empirical results on three benchmark corpora show that our algorithm can reduce the text similarity computation substantially and outperform the-state-of-the-art KNN, Naive Bayes and Support Vector Machine classifiers. In addition, the classification model constructed by the proposed algorithm can be updated incrementally, and it has great scalability in many real-word applications.

[1]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[2]  Xu Xin,et al.  Advances in Machine Learning Based Text Categorization , 2006 .

[3]  Jung-Hsien Chiang,et al.  Hierarchically SVM classification based on support vector clustering method and its application to document categorization , 2007, Expert Syst. Appl..

[4]  Sholom M. Weiss,et al.  Automated learning of decision rules for text categorization , 1994, TOIS.

[5]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[6]  Yiming Yang,et al.  A re-examination of text categorization methods , 1999, SIGIR '99.

[7]  Hui Wang,et al.  A clustering-based method for unsupervised intrusion detections , 2006, Pattern Recognit. Lett..

[8]  Enrico Blanzieri,et al.  A survey of learning-based techniques of email spam filtering , 2008, Artificial Intelligence Review.

[9]  Zaher Al Aghbari,et al.  Array-index: a plug&search K nearest neighbors method for high-dimensional data , 2005, Data Knowl. Eng..

[10]  Ron Kohavi,et al.  A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection , 1995, IJCAI.

[11]  Yaxin Bi,et al.  An kNN Model-Based Approach and Its Application in Text Categorization , 2004, CICLing.

[12]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[13]  Li Yong A Density-based Method for Reducing Training Data in KNN , 2013 .

[14]  Padmini Srinivasan,et al.  Hierarchical Text Categorization Using Neural Networks , 2004, Information Retrieval.

[15]  Rung Ching Chen,et al.  Web page classification based on a support vector machine using a weighted vote schema , 2006, Expert Syst. Appl..

[16]  J. J. Rocchio,et al.  Relevance feedback in information retrieval , 1971 .

[17]  BlanzieriEnrico,et al.  A survey of learning-based techniques of email spam filtering , 2008 .

[18]  Eibe Frank,et al.  Naive Bayes for Text Classification with Unbalanced Classes , 2006, PKDD.

[19]  Songbo Tan,et al.  An improved centroid classifier for text categorization , 2008, Expert Syst. Appl..

[20]  Weiming Shen,et al.  eMarketplaces for enterprise and cross enterprise integration , 2005, Data Knowl. Eng..

[21]  Yu Wang,et al.  A Fast KNN Algorithm for Text Categorization , 2007, 2007 International Conference on Machine Learning and Cybernetics.

[22]  James P. Callan,et al.  Training algorithms for linear text classifiers , 1996, SIGIR '96.

[23]  Songbo Tan,et al.  Neighbor-weighted K-nearest neighbor for unbalanced text corpus , 2005, Expert Syst. Appl..

[24]  Qiang Ye,et al.  Sentiment classification of online reviews to travel destinations by supervised machine learning approaches , 2009, Expert Syst. Appl..

[25]  Sheng-Yi Jiang Efficient Classification Method for Large Dataset , 2006, 2006 International Conference on Machine Learning and Cybernetics.

[26]  Georgios Paliouras,et al.  Learning to Filter Spam E-Mail: A Comparison of a Naive Bayesian and a Memory-Based Approach , 2000, ArXiv.

[27]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[28]  Martin L. Kersten,et al.  Efficient k-NN search on vertically decomposed data , 2002, SIGMOD '02.

[29]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[30]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[31]  Georgios Paliouras,et al.  Stacking Classifiers for Anti-Spam Filtering of E-Mail , 2001, EMNLP.

[32]  Irena Koprinska,et al.  Learning to classify e-mail , 2007, Inf. Sci..

[33]  Hu Yunfa,et al.  A Strategy to Class Imbalance Problem for kNN Text Classifier , 2009 .

[34]  Georgios Paliouras,et al.  An evaluation of Naive Bayesian anti-spam filtering , 2000, ArXiv.

[35]  Andrea Esuli,et al.  Boosting multi-label hierarchical text categorization , 2008, Information Retrieval.