Feature selection, mutual information, and the classification of high-dimensional patterns

We propose a novel feature selection filter for supervised learning, which relies on the efficient estimation of the mutual information between a high-dimensional set of features and the classes. We bypass the estimation of the probability density function with the aid of the entropic-graphs approximation of Rényi entropy, and the subsequent approximation of the Shannon entropy. Thus, the complexity does not depend on the number of dimensions but on the number of patterns/samples, and the curse of dimensionality is circumvented. We show that it is then possible to outperform algorithms which individually rank features, as well as a greedy algorithm based on the maximal relevance and minimal redundancy criterion. We successfully test our method both in the contexts of image classification and microarray data classification. For most of the tested data sets, we obtain better classification results than those reported in the literature.

[1]  Liam Paninski,et al.  Estimation of Entropy and Mutual Information , 2003, Neural Computation.

[2]  Alfred O. Hero,et al.  Image registration methods in high‐dimensional space , 2006, Int. J. Imaging Syst. Technol..

[3]  Karol Zyczkowski,et al.  Rényi Extrapolation of Shannon Entropy , 2003, Open Syst. Inf. Dyn..

[4]  D. Bertsimas,et al.  An asymptotic determination of the minimum spanning tree and minimum matching constants in geometrical probability , 1990 .

[5]  Juan Manuel Sáez,et al.  EBEM: An Entropy-based EM Algorithm for Gaussian Mixture Models , 2006, 18th International Conference on Pattern Recognition (ICPR'06).

[6]  John Krumm,et al.  Object recognition with color cooccurrence histograms , 1999, Proceedings. 1999 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Cat. No PR00149).

[7]  J. Strassmann,et al.  A selfish strategy of social insect workers that promotes social cohesion , 1993, Nature.

[8]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[9]  Ramón Díaz-Uriarte,et al.  Gene selection and classification of microarray data using random forest , 2006, BMC Bioinformatics.

[10]  Kari Torkkola,et al.  Feature Extraction by Non-Parametric Mutual Information Maximization , 2003, J. Mach. Learn. Res..

[11]  Edward R. Dougherty,et al.  What should be expected from feature selection in small-sample settings , 2006, Bioinform..

[12]  Jesús S. Aguilar-Ruiz,et al.  Incremental wrapper-based gene selection from microarray data for cancer classification , 2006, Pattern Recognit..

[13]  Robert P. W. Duin,et al.  Pairwise feature evaluation for constructing reduced representations , 2007, Pattern Analysis and Applications.

[14]  Alex M. Andrew,et al.  Object Recognition in Man, Monkey, and Machine , 2000 .

[15]  David R. Wolf,et al.  Estimating functions of probability distributions from a finite set of samples. , 1994, Physical review. E, Statistical physics, plasmas, fluids, and related interdisciplinary topics.

[16]  Abdelkader Mokkadem,et al.  Estimation of the entropy and information of absolutely continuous random variables , 1989, IEEE Trans. Inf. Theory.

[17]  Michael I. Jordan,et al.  Feature selection for high-dimensional genomic microarray data , 2001, ICML.

[18]  Loic Boisrobert,et al.  Assessment of artery dilation by using image registration based on spatial features , 2005, SPIE Medical Imaging.

[19]  R. Hess,et al.  Low spatial frequencies are suppressively masked across spatial scale, orientation, field position, and eye of origin. , 2004, Journal of vision.

[20]  Georgia D. Tourassi,et al.  Estimation of generalized entropies with sample spacing , 2005, Pattern Analysis and Applications.

[21]  James Theiler,et al.  Online Feature Selection using Grafting , 2003, ICML.

[22]  E. Lander,et al.  Gene expression correlates of clinical prostate cancer behavior. , 2002, Cancer cell.

[23]  Mineichi Kudo,et al.  Classifier-independent feature selection on the basis of divergence criterion , 2006, Pattern Analysis and Applications.

[24]  G. Stolovitzky Gene selection in microarray data: the elephant, the blind men and our algorithms. , 2003, Current opinion in structural biology.

[25]  Philip N. Klein,et al.  A randomized linear-time algorithm to find minimum spanning trees , 1995, JACM.

[26]  Pavlos Pavlidis,et al.  Individualized markers optimize class prediction of microarray data , 2006, BMC Bioinformatics.

[27]  Heinrich H. Bülthoff,et al.  Object recognition in man, monkey, and machine , 1999 .

[28]  Jesper Larsson Träff,et al.  A Practical Minimum Spanning Tree Algorithm Using the Cycle Property , 2003, ESA.

[29]  Alfred O. Hero,et al.  Asymptotic theory of greedy approximations to minimal k-point random graphs , 1999, IEEE Trans. Inf. Theory.

[30]  Pat Langley,et al.  Selection of Relevant Features and Examples in Machine Learning , 1997, Artif. Intell..

[31]  Claudio Gentile,et al.  Fast Feature Selection from Microarray Expression Data via Multiplicative Large Margin Algorithms , 2003, NIPS.

[32]  ShyjanMahamud OwenCarmichael Discriminant Filters For Object Recognition , 2002 .

[33]  Fuhui Long,et al.  Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy , 2003, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[34]  Alfred O. Hero,et al.  Applications of entropic spanning graphs , 2002, IEEE Signal Process. Mag..

[35]  J. Stuart Aitken,et al.  Feature selection and classification for microarray data analysis: Evolutionary methods for identifying predictive genes , 2005, BMC Bioinformatics.

[36]  Reinhard Wolf,et al.  Visual pattern recognition in Drosophila involves retinotopic matching , 1993, Nature.

[37]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[38]  Danica Kragic,et al.  Object recognition and pose estimation using color cooccurrence histograms and geometric modeling , 2005, Image Vis. Comput..