Information-theoretic feature selection algorithms for text classification

A major characteristic of text document classification problem is extremely high dimensionality of text data. In this paper, we present four new algorithms for feature/word selection for the purpose of text classification. We use sequential forward selection methods based on improved mutual information criterion functions. The performance of the proposed evaluation functions compared to the information gain which evaluate features individually is discussed. We present experimental results using naive Bayes classifier based on multinomial model, linear support vector machine and k-nearest neighbor classifiers on the Reuters data set. Finally, we analyze the experimental results from various perspectives, including precision, recall and F/sub 1/-measure. Preliminary experimental results indicate the effectiveness of the proposed feature selection algorithms in a text classification.

[1]  Sebastian Thrun,et al.  Text Classification from Labeled and Unlabeled Documents using EM , 2000, Machine Learning.

[2]  Dunja Mladenic,et al.  Feature Selection for Unbalanced Class Distribution and Naive Bayes , 1999, ICML.

[3]  Stan Matwin,et al.  A learner-independent evaluation of the usefulness of statistical phrases for automated text categorization , 2001 .

[4]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[5]  Pavel Pudil,et al.  Oscillating search algorithms for feature selection , 2000, Proceedings 15th International Conference on Pattern Recognition. ICPR-2000.

[6]  Anil K. Jain,et al.  Statistical Pattern Recognition: A Review , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[7]  Chong-Ho Choi,et al.  Improved mutual information feature selector for neural networks in supervised learning , 1999, IJCNN'99. International Joint Conference on Neural Networks. Proceedings (Cat. No.99CH36339).

[8]  Dunja Mladenic,et al.  Feature Subset Selection in Text-Learning , 1998, ECML.

[9]  Yiming Yang,et al.  An Evaluation of Statistical Approaches to Text Categorization , 1999, Information Retrieval.

[10]  Gobinda G. Chowdhury Text Databases and Document Management: Theory and Practice , 2002 .

[11]  Andrew McCallum,et al.  A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[12]  Josef Kittler,et al.  Divergence Based Feature Selection for Multimodal Class Densities , 1996, IEEE Trans. Pattern Anal. Mach. Intell..

[13]  Roberto Battiti,et al.  Using mutual information for selecting features in supervised neural net learning , 1994, IEEE Trans. Neural Networks.

[14]  Josef Kittler,et al.  Floating search methods in feature selection , 1994, Pattern Recognit. Lett..

[15]  Yiming Yang,et al.  A scalability analysis of classifiers in text categorization , 2003, SIGIR.

[16]  George Forman,et al.  An Extensive Empirical Study of Feature Selection Metrics for Text Classification , 2003, J. Mach. Learn. Res..

[17]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[18]  Yiming Yang,et al.  A re-examination of text categorization methods , 1999, SIGIR '99.

[19]  Josef Kittler,et al.  Feature selection based on the approximation of class densities by finite mixtures of special type , 1995, Pattern Recognit..

[20]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.