Fast Binary Feature Selection with Conditional Mutual Information

We propose in this paper a very fast feature selection technique based on conditional mutual information. By picking features which maximize their mutual information with the class to predict conditional to any feature already picked, it ensures the selection of features which are both individually informative and two-by-two weakly dependant. We show that this feature selection method outperforms other classical algorithms, and that a naive Bayesian classifier built with features selected that way achieves error rates similar to those of state-of-the-art methods such as boosting or SVMs. The implementation we propose selects 50 features among 40,000, based on a training set of 500 examples in a tenth of a second on a standard 1Ghz PC.

[1]  F ROSENBLATT,et al.  The perceptron: a probabilistic model for information storage and organization in the brain. , 1958, Psychological review.

[2]  Albert B Novikoff,et al.  ON CONVERGENCE PROOFS FOR PERCEPTRONS , 1963 .

[3]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[4]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[5]  Bernhard E. Boser,et al.  A training algorithm for optimal margin classifiers , 1992, COLT '92.

[6]  Pat Langley,et al.  An Analysis of Bayesian Classifiers , 1992, AAAI.

[7]  Roberto Battiti,et al.  Using mutual information for selecting features in supervised neural net learning , 1994, IEEE Trans. Neural Networks.

[8]  A. S. Weigend,et al.  Selecting Input Variables Using Mutual Information and Nonparemetric Density Estimation , 1994 .

[9]  Daphne Koller,et al.  Toward Optimal Feature Selection , 1996, ICML.

[10]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[11]  Yoav Freund,et al.  Game theory, on-line prediction and boosting , 1996, COLT '96.

[12]  Ron Kohavi,et al.  Wrappers for Feature Subset Selection , 1997, Artif. Intell..

[13]  Yali Amit,et al.  Joint Induction of Shape Features and Tree Classifiers , 1997, IEEE Trans. Pattern Anal. Mach. Intell..

[14]  Robert Tibshirani,et al.  Classification by Pairwise Coupling , 1997, NIPS.

[15]  Gunnar Rätsch,et al.  Regularizing AdaBoost , 1998, NIPS.

[16]  L. Breiman Random Forests--random Features , 1999 .

[17]  Peter L. Bartlett,et al.  Boosting Algorithms as Gradient Descent , 1999, NIPS.

[18]  Donald Geman,et al.  Coarse-to-Fine Visual Selection , 1999 .

[19]  L. Breiman SOME INFINITY THEORY FOR PREDICTOR ENSEMBLES , 2000 .

[20]  Michael J. Pazzani,et al.  Collaborative Filtering with the Simple Bayesian Classifier , 2000, PRICAI.

[21]  Nello Cristianini,et al.  Support vector machine classification and validation of cancer tissue samples using microarray expression data , 2000, Bioinform..

[22]  Nello Cristianini,et al.  An Introduction to Support Vector Machines and Other Kernel-based Learning Methods , 2000 .

[23]  J. Friedman Special Invited Paper-Additive logistic regression: A statistical view of boosting , 2000 .

[24]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[25]  Sanmay Das,et al.  Filters, Wrappers and a Boosting-Based Hybrid for Feature Selection , 2001, ICML.

[26]  Donald Geman,et al.  Fast face detection with precise pose estimation , 2002, Object recognition supported by user interaction for service robots.

[27]  F. Fleuret Binary Feature Selection with Conditional Mutual Information , 2003 .

[28]  Dimitrios Gunopulos,et al.  Feature selection for the naive bayesian classifier using decision trees , 2003, Appl. Artif. Intell..

[29]  Kari Torkkola,et al.  Feature Extraction by Non-Parametric Mutual Information Maximization , 2003, J. Mach. Learn. Res..

[30]  Shimon Ullman,et al.  Object recognition with informative features and linear classification , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[31]  Huan Liu,et al.  Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution , 2003, ICML.

[32]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[33]  Donald Geman,et al.  Coarse-to-Fine Face Detection , 2004, International Journal of Computer Vision.

[34]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[35]  Steve R. Gunn,et al.  Result Analysis of the NIPS 2003 Feature Selection Challenge , 2004, NIPS.