The Fisher-Markov Selector: Fast Selecting Maximally Separable Feature Subset for Multiclass Classification with Applications to High-Dimensional Data

Selecting features for multiclass classification is a critically important task for pattern recognition and machine learning applications. Especially challenging is selecting an optimal subset of features from high-dimensional data, which typically have many more variables than observations and contain significant noise, missing components, or outliers. Existing methods either cannot handle high-dimensional data efficiently or scalably, or can only obtain local optimum instead of global optimum. Toward the selection of the globally optimal subset of features efficiently, we introduce a new selector - which we call the Fisher-Markov selector - to identify those features that are the most useful in describing essential differences among the possible groups. In particular, in this paper we present a way to represent essential discriminating characteristics together with the sparsity as an optimization objective. With properly identified measures for the sparseness and discriminativeness in possibly high-dimensional settings, we take a systematic approach for optimizing the measures to choose the best feature subset. We use Markov random field optimization techniques to solve the formulated objective functions for simultaneous feature selection. Our results are noncombinatorial, and they can achieve the exact global optimum of the objective function for some special kernels. The method is fast; in particular, it can be linear in the number of features and quadratic in the number of observations. We apply our procedure to a variety of real-world data, including mid--dimensional optical handwritten digit data set and high-dimensional microarray gene expression data sets. The effectiveness of our method is confirmed by experimental results. In pattern recognition and from a model selection viewpoint, our procedure says that it is possible to select the most discriminating subset of variables by solving a very simple unconstrained objective function which in fact can be obtained with an explicit expression.

[1]  Josef Kittler,et al.  Pattern recognition : a statistical approach , 1982 .

[2]  Jon R. Kettenring,et al.  Variable selection in clustering and other contexts , 1987 .

[3]  Patrick J. Grother,et al.  NIST Form-Based Handprint Recognition System | NIST , 1994 .

[4]  D. Botstein,et al.  A gene expression database for the molecular pharmacology of cancer , 2000, Nature Genetics.

[5]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[6]  Lei Wang,et al.  Feature Selection with Kernel Class Separability , 2008, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[7]  Martin J. Wainwright,et al.  MAP estimation via agreement on (hyper)trees: Message-passing and linear programming , 2005, ArXiv.

[8]  E. Lander,et al.  Gene expression correlates of clinical prostate cancer behavior. , 2002, Cancer cell.

[9]  Bernhard Schölkopf,et al.  Use of the Zero-Norm with Linear Models and Kernel Methods , 2003, J. Mach. Learn. Res..

[10]  P. Bickel,et al.  Some theory for Fisher''s linear discriminant function , 2004 .

[11]  Thomas M. Cover,et al.  Geometrical and Statistical Properties of Systems of Linear Inequalities with Applications in Pattern Recognition , 1965, IEEE Trans. Electron. Comput..

[12]  Daphne Koller,et al.  Toward Optimal Feature Selection , 1996, ICML.

[13]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[14]  Sanja Fidler,et al.  Combining reconstructive and discriminative subspace methods for robust classification and regression by subsampling , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[15]  Gunnar Rätsch,et al.  A Mathematical Programming Approach to the Kernel Fisher Algorithm , 2000, NIPS.

[16]  Keinosuke Fukunaga,et al.  A Branch and Bound Algorithm for Feature Subset Selection , 1977, IEEE Transactions on Computers.

[17]  R. Fisher THE USE OF MULTIPLE MEASUREMENTS IN TAXONOMIC PROBLEMS , 1936 .

[18]  Hiroshi Ishikawa,et al.  Exact Optimization for Markov Random Fields with Convex Priors , 2003, IEEE Trans. Pattern Anal. Mach. Intell..

[19]  C. H. Chen,et al.  On a class of computationally efficient feature selection criteria , 1975, Pattern Recognit..

[20]  P. Greenwood,et al.  Contiguity and the statistical invariance principle , 1987 .

[21]  Gerhard Winkler,et al.  Image analysis, random fields and dynamic Monte Carlo methods: a mathematical introduction , 1995, Applications of mathematics.

[22]  Vladimir Cherkassky,et al.  The Nature Of Statistical Learning Theory , 1997, IEEE Trans. Neural Networks.

[23]  M. Garris NIST form-based handprint recognition system , 1994 .

[24]  Christian A. Rees,et al.  Systematic variation in gene expression patterns in human cancer cell lines , 2000, Nature Genetics.

[25]  H. D. Ratliff,et al.  Minimum cuts and related problems , 1975, Networks.

[26]  J. Welsh,et al.  Analysis of gene expression identifies candidate markers and pharmacological targets in prostate cancer. , 2001, Cancer research.

[27]  Catalin Starica,et al.  Gaussian and Non-Gaussian Linear Time Series and Random Fields , 2001 .

[28]  Michael D. Garris,et al.  NIST Form-Based Handprint Recognition System NISTIR 5469 , 1994 .

[29]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[30]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[31]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[32]  King-Sun Fu,et al.  Feature Selection in Pattern Recognition , 1970, IEEE Trans. Syst. Sci. Cybern..

[33]  King-Sun Fu,et al.  Sequential Methods in Pattern Recognition and Machine Learning , 2012 .

[34]  Michael Elad,et al.  Optimally sparse representation in general (nonorthogonal) dictionaries via ℓ1 minimization , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[35]  R. Plackett,et al.  Introduction to Statistical Analysis. , 1952 .

[36]  Donald Geman,et al.  Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images , 1984, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[37]  Olga Veksler,et al.  Fast approximate energy minimization via graph cuts , 2001, Proceedings of the Seventh IEEE International Conference on Computer Vision.

[38]  Sing Bing Kang,et al.  An MRF-Based DeInterlacing Algorithm With Exemplar-Based Refinement , 2009, IEEE Transactions on Image Processing.

[39]  Dorit S. Hochbaum,et al.  An efficient algorithm for image segmentation, Markov random fields and related problems , 2001, JACM.

[40]  Dean P. Foster,et al.  The risk inflation criterion for multiple regression , 1994 .

[41]  Larry A. Rendell,et al.  A Practical Approach to Feature Selection , 1992, ML.

[42]  Ron Kohavi,et al.  Feature Selection for Knowledge Discovery and Data Mining , 1998 .

[43]  M. Kendall A course in multivariate analysis , 1958 .

[44]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[45]  William T. Freeman,et al.  Constructing free-energy approximations and generalized belief propagation algorithms , 2005, IEEE Transactions on Information Theory.

[46]  Pedro M. Domingos,et al.  On the Optimality of the Simple Bayesian Classifier under Zero-One Loss , 1997, Machine Learning.

[47]  Geoffrey J. McLachlan,et al.  A mixture model-based approach to the clustering of microarray expression data , 2002, Bioinform..

[48]  S. Dudoit,et al.  Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data , 2002 .

[49]  G. McLachlan Discriminant Analysis and Statistical Pattern Recognition , 1992 .

[50]  Alexander J. Smola,et al.  Learning with kernels , 1998 .

[51]  Terence Tao,et al.  The Dantzig selector: Statistical estimation when P is much larger than n , 2005, math/0506081.

[52]  Sayan Mukherjee,et al.  Feature Selection for SVMs , 2000, NIPS.

[53]  Patrick J. Grother,et al.  NIST Form-Based Handprint Recognition System , 1994 .

[54]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[55]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[56]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[57]  Martin J. Wainwright,et al.  MAP estimation via agreement on trees: message-passing and linear programming , 2005, IEEE Transactions on Information Theory.

[58]  Kohji Fukunaga,et al.  Introduction to Statistical Pattern Recognition-Second Edition , 1990 .

[59]  H. Akaike A new look at the statistical model identification , 1974 .

[60]  Shigeo Abe DrEng Pattern Classification , 2001, Springer London.

[61]  Jorma Rissanen,et al.  Stochastic Complexity in Statistical Inquiry , 1989, World Scientific Series in Computer Science.

[62]  Denis Bosq,et al.  Nonparametric statistics for stochastic processes , 1996 .

[63]  Fuhui Long,et al.  Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy , 2003, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[64]  Pavel Pudil,et al.  Introduction to Statistical Pattern Recognition , 2006 .

[65]  M. Rosenblatt Reversibility and Identifiability , 2000 .

[66]  J. Wade Davis,et al.  Statistical Pattern Recognition , 2003, Technometrics.

[67]  Jianqing Fan,et al.  High Dimensional Classification Using Features Annealed Independence Rules. , 2007, Annals of statistics.

[68]  E. Candès,et al.  Stable signal recovery from incomplete and inaccurate measurements , 2005, math/0503066.