A Stochastic Algorithm for Feature Selection in Pattern Recognition

We introduce a new model addressing feature selection from a large dictionary of variables that can be computed from a signal or an image. Features are extracted according to an efficiency criterion, on the basis of specified classification or recognition tasks. This is done by estimating a probability distribution P on the complete dictionary, which distributes its mass over the more efficient, or informative, components. We implement a stochastic gradient descent algorithm, using the probability as a state variable and optimizing a multi-task goodness of fit criterion for classifiers based on variable randomly chosen according to P. We then generate classifiers from the optimal distribution of weights learned on the training set. The method is first tested on several pattern recognition problems including face detection, handwritten digit recognition, spam classification and micro-array analysis. We then compare our approach with other step-wise algorithms like random forests or recursive feature elimination.

[1]  Thomas G. Dietterich,et al.  Solving Multiclass Learning Problems via Error-Correcting Output Codes , 1994, J. Artif. Intell. Res..

[2]  Vladimir Cherkassky,et al.  The Nature Of Statistical Learning Theory , 1997, IEEE Trans. Neural Networks.

[3]  Donald Geman,et al.  Coarse-to-Fine Face Detection , 2004, International Journal of Computer Vision.

[4]  Apprentissage d'un vocabulaire symbolique pour la détection d'objets dans une image , 2004 .

[5]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[6]  Bruce A. Draper,et al.  Feature selection from huge feature sets , 2001, Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001.

[7]  Yali Amit,et al.  Shape Quantization and Recognition with Randomized Trees , 1997, Neural Computation.

[8]  David J. C. MacKay,et al.  A Practical Bayesian Framework for Backpropagation Networks , 1992, Neural Computation.

[9]  H. Kushner,et al.  Stochastic Approximation and Recursive Algorithms and Applications , 2003 .

[10]  James V. Candy,et al.  Adaptive and Learning Systems for Signal Processing, Communications, and Control , 2006 .

[11]  Eytan Ruppin,et al.  Feature Selection Based on the Shapley Value , 2005, IJCAI.

[12]  Yann LeCun,et al.  Memory-based character recognition using a transformation invariant metric , 1994, Proceedings of the 12th IAPR International Conference on Pattern Recognition, Vol. 3 - Conference C: Signal Processing (Cat. No.94CH3440-5).

[13]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition , 2001, Springer Series in Statistics.

[14]  D. J. Newman,et al.  UCI Repository of Machine Learning Database , 1998 .

[15]  Harold J. Kushner,et al.  Rate of Convergence for Constrained Stochastic Approximation Algorithms , 2001, SIAM J. Control. Optim..

[16]  Thorsten Joachims,et al.  Detecting Concept Drift with Support Vector Machines , 2000, ICML.

[17]  M. Benaïm A Dynamical System Approach to Stochastic Approximations , 1996 .

[18]  P. Dupuis,et al.  On Lipschitz continuity of the solution mapping to the Skorokhod problem , 1991 .

[19]  Jian Li,et al.  Iterative RELIEF for feature weighting , 2006, ICML.

[20]  Michael I. Jordan,et al.  Feature selection for high-dimensional genomic microarray data , 2001, ICML.

[21]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[22]  Steve R. Gunn,et al.  Result Analysis of the NIPS 2003 Feature Selection Challenge , 2004, NIPS.

[23]  Sayan Mukherjee,et al.  Choosing Multiple Parameters for Support Vector Machines , 2002, Machine Learning.

[24]  J. Friedman Special Invited Paper-Additive logistic regression: A statistical view of boosting , 2000 .

[25]  Jason Weston,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2002, Machine Learning.

[26]  Huan Liu,et al.  Feature subset selection bias for classification learning , 2006, ICML.

[27]  P. Dupuis,et al.  Convex duality and the Skorokhod Problem. II , 1999 .

[28]  Naftali Tishby,et al.  Margin based feature selection - theory and algorithms , 2004, ICML.

[29]  I. Guyon,et al.  Teaching machine learning from examples , 2006 .

[30]  P. Cunningham,et al.  Solutions to Instability Problems with Sequential Wrapper-based Approaches to Feature Selection , 2002 .

[31]  Bernhard Schölkopf,et al.  Combining a Filter Method with SVMs , 2006, Feature Extraction.

[32]  Christopher J. Merz,et al.  UCI Repository of Machine Learning Databases , 1996 .

[33]  Bernhard Schölkopf,et al.  Use of the Zero-Norm with Linear Models and Kernel Methods , 2003, J. Mach. Learn. Res..

[34]  Huan Liu,et al.  Efficient Feature Selection via Analysis of Relevance and Redundancy , 2004, J. Mach. Learn. Res..

[35]  L. Goddard Information Theory , 1962, Nature.

[36]  Yan Liu,et al.  Fast Video Retrieval under Sparse Training Data , 2003, CIVR.

[37]  Thorsten Joachims,et al.  Optimizing search engines using clickthrough data , 2002, KDD.

[38]  Alexander J. Smola,et al.  Learning with Kernels: support vector machines, regularization, optimization, and beyond , 2001, Adaptive computation and machine learning series.

[39]  Pierre Priouret,et al.  Adaptive Algorithms and Stochastic Approximations , 1990, Applications of Mathematics.

[40]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[41]  L. Breiman Arcing Classifiers , 1998 .

[42]  P. Dupuis,et al.  Convex duality and the Skorokhod Problem. I , 1999 .

[43]  M. Benaïm Convergence with probability one of stochastic approximation algorithms whose average is cooperative , 2000 .

[44]  Christian Jutten,et al.  Blind separation of sources, part I: An adaptive algorithm based on neuromimetic architecture , 1991, Signal Process..

[45]  Ronald L. Rivest,et al.  Training a 3-node neural network is NP-complete , 1988, COLT '88.

[46]  F. Fleuret Fast Binary Feature Selection with Conditional Mutual Information , 2004, J. Mach. Learn. Res..

[47]  Elie Bienenstock,et al.  Neural Networks and the Bias/Variance Dilemma , 1992, Neural Computation.

[48]  Yali Amit,et al.  A Computational Model for Visual Selection , 1999, Neural Computation.

[49]  Byoung-Tak Zhang,et al.  Combining Information-Based Supervised and Unsupervised Feature Selection , 2006, Feature Extraction.

[50]  Zhili Wu,et al.  Feature Selection for Classification using Transductive Support Vector Machines , 2004 .

[51]  Bernhard Schölkopf,et al.  Extracting Support Data for a Given Task , 1995, KDD.

[52]  K. Deb,et al.  Reliable classification of two-class cancer data using evolutionary algorithms. , 2003, Bio Systems.

[53]  Sayan Mukherjee,et al.  Feature Selection for SVMs , 2000, NIPS.

[54]  Chih-Jen Lin,et al.  Combining SVMs with Various Feature Selection Strategies , 2006, Feature Extraction.

[55]  Robert B. Ash,et al.  Information Theory , 2020, The SAGE International Encyclopedia of Mass Media and Society.

[56]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.