A Fast KNN Algorithm for Text Categorization

The KNN algorithm applied to text categorization is a simple, valid and non-parameter method. The traditional KNN has a fatal defect that the time of similarity computing is huge. The practicality will be lost when the KNN algorithm is applied to text categorization with the high dimension and huge samples. In this paper, a method called TFKNN(Tree-Fast-K-Nearest-Neighbor) is presented, which can search the exact k nearest neighbors quickly. In the method, a SSR tree for searching K nearest neighbors is created, in which all child nodes of each non-leaf node are ranked according to the distances between their central points and the central point of their parent. Then the searching scope is reduced based on the tree. Subsequently , the time of similarity computing is decreased largely.

[1]  Ramesh C. Jain,et al.  Similarity indexing with the SS-tree , 1996, Proceedings of the Twelfth International Conference on Data Engineering.

[2]  Martin L. Kersten,et al.  Efficient k-NN search on vertically decomposed data , 2002, SIGMOD '02.

[3]  Xiaoming Zhu,et al.  An efficient indexing method for nearest neighbor searches in high-dirnensional image databases , 2002, IEEE Trans. Multim..

[4]  Weiming Shen,et al.  eMarketplaces for enterprise and cross enterprise integration , 2005, Data Knowl. Eng..

[5]  Beng Chin Ooi,et al.  Indexing the Distance: An Efficient Method to KNN Processing , 2001, VLDB.

[6]  Yu Wang,et al.  Text categorization rule extraction based on fuzzy decision tree , 2005, 2005 International Conference on Machine Learning and Cybernetics.

[7]  Zaher Al Aghbari,et al.  Array-index: a plug&search K nearest neighbors method for high-dimensional data , 2005, Data Knowl. Eng..

[8]  Yiming Yang,et al.  A re-examination of text categorization methods , 1999, SIGIR '99.