Fast text classification: a training-corpus pruning based approach

With the rapid growth of on-line information available, text classification is becoming more and more important. kNN is a widely used text classification method of high performance. However, this method is inefficient because it requires a large amount of computation for evaluating the similarity between a test document and each training document. In this paper, we propose a fast kNN text classification approach based on pruning the training corpus. By using this approach, the size of training corpus can be condensed sharply so that time-consuming on kNN searching can be cut off significantly, and consequently classification efficiency can be improved substantially while classification performance is preserved comparable to that of without pruning. Effective, algorithm for text corpus pruning is designed. Experiments over the Reuters corpus are carried out, which validate the practicability of the proposed approach. Our approach is especially suitable for on-line text classification applications.

[1]  Yiming Yang,et al.  An example-based mapping method for text categorization and retrieval , 1994, TOIS.

[2]  Yiming Yang,et al.  A re-examination of text categorization methods , 1999, SIGIR '99.

[3]  David B. Skalak,et al.  Prototype Selection for Composite Nearest Neighbor Classifiers , 1995 .

[4]  Andreas S. Weigend,et al.  A neural network approach to topic spotting , 1995 .

[5]  Ramesh C. Jain,et al.  Similarity indexing with the SS-tree , 1996, Proceedings of the Twelfth International Conference on Data Engineering.

[6]  Koby Crammer,et al.  A new family of online algorithms for category ranking , 2002, SIGIR '02.

[7]  Yoram Singer,et al.  Context-sensitive learning methods for text categorization , 1996, SIGIR '96.

[8]  George Karypis,et al.  Centroid-Based Document Classification Algorithms: Analysis & Experimental Results , 2000 .

[9]  Yiming Yang,et al.  An Evaluation of Statistical Approaches to Text Categorization , 1999, Information Retrieval.

[10]  Jianping Zhang,et al.  Selecting Typical Instances in Instance-Based Learning , 1992, ML.

[11]  James P. Callan,et al.  Training algorithms for linear text classifiers , 1996, SIGIR '96.

[12]  Barry Smyth,et al.  Remembering To Forget: A Competence-Preserving Case Deletion Policy for Case-Based Reasoning Systems , 1995, IJCAI.

[13]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[14]  Wai Lam,et al.  Using a generalized instance set for automatic text categorization , 1998, SIGIR '98.

[15]  Tony R. Martinez,et al.  Instance Pruning Techniques , 1997, ICML.

[16]  Sholom M. Weiss,et al.  Towards language independent automated learning of text categorization models , 1994, SIGIR '94.

[17]  David D. Lewis,et al.  Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval , 1998, ECML.

[18]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[19]  Beng Chin Ooi,et al.  Indexing the Distance: An Efficient Method to KNN Processing , 2001, VLDB.

[20]  Yoram Singer,et al.  Improved Boosting Algorithms Using Confidence-rated Predictions , 1998, COLT' 98.