A Linear Text Classification Algorithm Based on Category Relevance Factors

In this paper, we present a linear text classification algorithm called CRF. By using category relevance factors, CRF computes the feature vectors of training documents belonging to the same category. Based on these feature vectors, CRF induces the profile vector of each category. For new unlabelled documents, CRF adopts a modified cosine measure to obtain similarities between these documents and categories and assigns them to categories that have the biggest similarity scores. In CRF, it is profile vectors not vectors of all training documents that join in computing the similarities between documents and categories. We evaluated our algorithm on a subset of Reuters-21578 and 20_newsgroups text collections and compared it against k-NN and SVM. Experimental results show that CRF outperforms k-NN and is competitive with SVM.

[1]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[2]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[3]  Thorsten Joachims,et al.  Making large scale SVM learning practical , 1998 .

[4]  Yiming Yang,et al.  A re-examination of text categorization methods , 1999, SIGIR '99.

[5]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[6]  Vladimir Vapnik,et al.  The Nature of Statistical Learning , 1995 .

[7]  Takenobu Tokunaga,et al.  Cluster-based text categorization: a comparison of category search strategies , 1995, SIGIR '95.

[8]  Andrew McCallum,et al.  A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[9]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[10]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[11]  David D. Lewis,et al.  Representation and Learning in Information Retrieval , 1991 .

[12]  hierarchyDunja Mladeni Feature Selection for Classiication Based on Text Hierarchy , 1998 .

[13]  Yiming Yang,et al.  Expert network: effective and efficient learning from human decisions in text categorization and retrieval , 1994, SIGIR '94.

[14]  Fabrizio Sebastiani,et al.  A Tutorial on Automated Text Categorisation , 2000 .

[15]  Susan T. Dumais,et al.  Inductive learning algorithms and representations for text categorization , 1998, CIKM '98.

[16]  Yiming Yang,et al.  An example-based mapping method for text categorization and retrieval , 1994, TOIS.

[17]  David L. Waltz,et al.  Classifying news stories using memory based reasoning , 1992, SIGIR '92.

[18]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[19]  Gerard Salton,et al.  Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer , 1989 .

[20]  Yiming Yang,et al.  An Evaluation of Statistical Approaches to Text Categorization , 1999, Information Retrieval.

[21]  Clement T. Yu,et al.  Term Weighting in Information Retrieval Using the Term Precision Model , 1982, JACM.

[22]  Hwee Tou Ng,et al.  Feature selection, perceptron learning, and a usability case study for text categorization , 1997, SIGIR '97.