SRFW: a simple, fast and effective text classification algorithm

Text classification is a powerful technique for automating assignment of documents to topic hierarchies. Although there are a number of text classification algorithms, most of them are either inefficient or too complex. We present a linear text classification algorithm called SRFW, which is fast, effective and easily used. SRFW obtains relevance factors. For new unlabelled documents, SRFW adopts sum of weights based on relevance factors to obtain the probability that these documents belong to each category and assigns them to categories that have the biggest probability. We have evaluated our algorithm on a subset of Reuters-21578 and 20-newsgroups text collections and compared it against k-NN and SVM. Experimental results show that SRFW is competitive with k nearest neighbor (k-NN) and support vector machines (SVM), while SRFW is much simpler and faster than them.

[1]  Yiming Yang,et al.  An Evaluation of Statistical Approaches to Text Categorization , 1999, Information Retrieval.

[2]  Thorsten Joachims,et al.  Making large scale SVM learning practical , 1998 .

[3]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[4]  hierarchyDunja Mladeni Feature Selection for Classiication Based on Text Hierarchy , 1998 .

[5]  Yiming Yang,et al.  A re-examination of text categorization methods , 1999, SIGIR '99.

[6]  Vladimir Vapnik,et al.  The Nature of Statistical Learning , 1995 .

[7]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[8]  M. F. Porter,et al.  An algorithm for suffix stripping , 1997 .

[9]  Takenobu Tokunaga,et al.  Cluster-based text categorization: a comparison of category search strategies , 1995, SIGIR '95.

[10]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[11]  Hwee Tou Ng,et al.  Feature selection, perceptron learning, and a usability case study for text categorization , 1997, SIGIR '97.

[12]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[13]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[14]  David D. Lewis,et al.  Representation and Learning in Information Retrieval , 1991 .

[15]  Gerard Salton,et al.  Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer , 1989 .

[16]  Yiming Yang,et al.  An example-based mapping method for text categorization and retrieval , 1994, TOIS.