Feature selection with a measure of deviations from Poisson in text categorization

To improve the performance of automatic text classification, it is desirable to reduce a high dimensionality of the feature space. In this paper, we propose a new measure for selecting features, which estimates term importance based on how largely the probability distribution of each term deviates from the standard Poisson distribution. In information retrieval literatures, the deviation from Poisson has been used as a measure for weighting keywords and this motivates us to adopt the deviation from Poisson as a measure for feature selection in text classification tasks. The proposed measure is constructed so as to have the same computational complexity with other standard measures used for feature selection. To test the effectiveness of our method, we conducted evaluation experiments on Reuters-21578 corpus with support vector machine and k-NN classifiers. In the experiments, we performed binary classifications to determine whether each of the test documents belongs to a certain target category or not. For the target category, each of the top 10 categories of Reuters-21578 was used because of enough numbers of training and test documents. Four measures were used for feature selection; information gain (IG), @g^2-statistic, Gini index and the proposed measure in this work. Both the proposed measure and Gini index proved to be better than IG and @g^2-statistic in terms of macro-averaged and micro-averaged values of F"1, especially at higher vocabulary reduction levels.

[1]  Kenneth Ward Church,et al.  Using Suffix Arrays to Compute Term Frequency and Document Frequency for All Substrings in a Corpus , 2001, Computational Linguistics.

[2]  Wenqian Shang,et al.  A novel feature selection algorithm for text categorization , 2007, Expert Syst. Appl..

[3]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[4]  R. Larsen,et al.  An introduction to mathematical statistics and its applications (2nd edition) , by R. J. Larsen and M. L. Marx. Pp 630. £17·95. 1987. ISBN 13-487166-9 (Prentice-Hall) , 1987, The Mathematical Gazette.

[5]  Kenneth Ward Church,et al.  Inverse Document Frequency (IDF): A Measure of Deviations from Poisson , 1995, VLC@ACL.

[6]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[7]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[8]  Hiroshi Motoda,et al.  Computational Methods of Feature Selection , 2022 .

[9]  Céline Rouveirol,et al.  Proceedings of the 10th European Conference on Machine Learning , 1998 .

[10]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[11]  Martin Jansche,et al.  Parametric Models of Linguistic Count Data , 2003, ACL.

[12]  Kenneth Ward Church,et al.  Poisson mixtures , 1995, Natural Language Engineering.

[13]  Thorsten Joachims,et al.  Learning to classify text using support vector machines - methods, theory and algorithms , 2002, The Kluwer international series in engineering and computer science.

[14]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.