Using the feature projection technique based on a normalized voting method for text classification

This paper proposes a new approach for text categorization, based on a feature projection technique. In our approach, training data are represented as the projections of training documents on each feature. The voting for a classification is processed on the basis of individual feature projections. The final classification of test documents is determined by a majority voting from the individual classifications of each feature. Our empirical results show that the proposed approach, text categorization using feature projections (TCFP), outperforms k-NN, Rocchio, and Naive Bayes. Most of all, TCFP is a faster classifier, up to one hundred times faster than k-NN in the Newsgroups data set. It is also robust from noisy data. Since the TCFP algorithm is very simple, its implementation and training process can be done very easily. For these reasons, TCFP can be a useful classifier in text categorization tasks, which need fast execution speed, robustness, and high performance.

[1]  James P. Callan,et al.  Training algorithms for linear text classifiers , 1996, SIGIR '96.

[2]  T. Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1999, ECML.

[3]  Yiming Yang,et al.  An Evaluation of Statistical Approaches to Text Categorization , 1999, Information Retrieval.

[4]  Tom M. Mitchell,et al.  Learning to construct knowledge bases from the World Wide Web , 2000, Artif. Intell..

[5]  Andrew McCallum,et al.  A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[6]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[7]  Jason Weston,et al.  Support vector machines for multi-class pattern recognition , 1999, ESANN.

[8]  Jianping Zhang,et al.  Selecting Typical Instances in Instance-Based Learning , 1992, ML.

[9]  Jinwoo Park,et al.  Improving text categorization using the importance of sentences , 2004, Inf. Process. Manag..

[10]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[11]  C. G. Hilborn,et al.  The Condensed Nearest Neighbor Rule , 1967 .

[12]  David W. Aha,et al.  Instance-Based Learning Algorithms , 1991, Machine Learning.

[13]  Tony R. Martinez,et al.  Instance Pruning Techniques , 1997, ICML.

[14]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[15]  Peter E. Hart,et al.  The condensed nearest neighbor rule (Corresp.) , 1968, IEEE Trans. Inf. Theory.

[16]  Ralph Martinez,et al.  Reduction Techniques for Exemplar-Based Learning Algorithms , 1998 .

[17]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[18]  Donna Harman,et al.  Information Processing and Management , 2022 .

[19]  Yiming Yang,et al.  Expert network: effective and efficient learning from human decisions in text categorization and retrieval , 1994, SIGIR '94.

[20]  David G. Stork,et al.  Pattern Classification , 1973 .

[21]  Leah S. Larkey,et al.  Some Issues in the Automatic Classification of U.S. Patents Working Notes for the AAAI-98 Workshop on Learning for Text Categorization , 1998 .

[22]  Youngjoong Ko,et al.  Automatic Text Categorization by Unsupervised Learning , 2000, COLING.

[23]  Tony R. Martinez,et al.  Reduction Techniques for Instance-Based Learning Algorithms , 2000, Machine Learning.

[24]  Jason Weston Leave-One-Out Support Vector Machines , 1999, IJCAI.

[25]  Tom M. Mitchell,et al.  Using unlabeled data to improve text classification , 2001 .

[26]  Yiming Yang,et al.  A Study of Approaches to Hypertext Categorization , 2002, Journal of Intelligent Information Systems.

[27]  Shigeo Abe DrEng Pattern Classification , 2001, Springer London.

[28]  H. Altay Güvenir,et al.  K Nearest Neighbor Classification on Feature Projections , 1996, ICML.

[29]  David G. Stork,et al.  Pattern Classification (2nd ed.) , 1999 .

[30]  Gerard Salton,et al.  Improving retrieval performance by relevance feedback , 1997, J. Am. Soc. Inf. Sci..

[31]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .