A Bayesian feature selection paradigm for text classification

The automated classification of texts into predefined categories has witnessed a booming interest, due to the increased availability of documents in digital form and the ensuing need to organize them. An important problem for text classification is feature selection, whose goals are to improve classification effectiveness, computational efficiency, or both. Due to categorization unbalancedness and feature sparsity in social text collection, filter methods may work poorly. In this paper, we perform feature selection in the training process, automatically selecting the best feature subset by learning, from a set of preclassified documents, the characteristics of the categories. We propose a generative probabilistic model, describing categories by distributions, handling the feature selection problem by introducing a binary exclusion/inclusion latent vector, which is updated via an efficient Metropolis search. Real-life examples illustrate the effectiveness of the approach.

[1]  M. Vannucci,et al.  Bayesian Variable Selection in Clustering High-Dimensional Data , 2005 .

[2]  A. Brix Bayesian Data Analysis, 2nd edn , 2005 .

[3]  ChengXiang Zhai,et al.  Probabilistic Relevance Models Based on Document and Query Generation , 2003 .

[4]  J. York,et al.  Bayesian Graphical Models for Discrete Data , 1995 .

[5]  Daphne Koller,et al.  Toward Optimal Feature Selection , 1996, ICML.

[6]  Yiming Yang,et al.  RCV1: A New Benchmark Collection for Text Categorization Research , 2004, J. Mach. Learn. Res..

[7]  P. Green,et al.  Corrigendum: On Bayesian analysis of mixtures with an unknown number of components , 1997 .

[8]  T. Fearn,et al.  Multivariate Bayesian variable selection and prediction , 1998 .

[9]  David D. Lewis,et al.  Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval , 1998, ECML.

[10]  Marina Vannucci,et al.  Variable selection in clustering via Dirichlet process mixture models , 2006 .

[11]  David B. Dunson,et al.  Bayesian Data Analysis , 2010 .

[12]  G. Casella,et al.  Clustering using objective functions and stochastic search , 2008 .

[13]  Jun S. Liu,et al.  Bayesian Models for Multiple Local Sequence Alignment and Gibbs Sampling Strategies , 1995 .

[14]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[15]  P. Green,et al.  On Bayesian Analysis of Mixtures with an Unknown Number of Components (with discussion) , 1997 .

[16]  Yiming Yang,et al.  High-performing feature selection for text classification , 2002, CIKM '02.

[17]  Wei-Chien Chang On using Principal Components before Separating a Mixture of Two Multivariate Normal Distributions , 1983 .

[18]  Marina Vannucci,et al.  Bayesian Variable Selection in Multinomial Probit Models to Identify Molecular Signatures of Disease Stage , 2004, Biometrics.

[19]  David R. Cox The analysis of binary data , 1970 .

[20]  Cheng Hua Li,et al.  Combination of modified BPNN algorithms and an efficient feature selection method for text categorization , 2009, Inf. Process. Manag..