A Performance Evaluation of Mutual Information Estimators for Multivariate Feature Selection

Mutual information is one of the most popular criteria used in feature selection, for which many estimation techniques have been proposed. The large majority of them are based on probability density estimation and perform badly when faced to high-dimensional data, because of the curse of dimensionality. However, being able to evaluate robustly the mutual information between a subset of features and an output vector can be of great interest in feature selection. This is particularly the case when some features are only jointly redundant or relevant. In this paper, different mutual information estimators are compared according to important criteria for feature selection; the interest of a nearest neighbors-based estimator is shown.

[1]  Carsten O. Daub,et al.  Estimating mutual information using B-spline functions – an improved similarity measure for analysing gene expression data , 2004, BMC Bioinformatics.

[2]  Michel Verleysen,et al.  Mutual information for the selection of relevant variables in spectrometric nonlinear modelling , 2006, ArXiv.

[3]  M. C. Jones,et al.  On optimal data-based bandwidth selection in kernel density estimation , 1991 .

[4]  Richard Bellman,et al.  Adaptive Control Processes - A Guided Tour (Reprint from 1961) , 2015, Princeton Legacy Library.

[5]  D. W. Scott On optimal and data based histograms , 1979 .

[6]  J. Friedman Multivariate adaptive regression splines , 1990 .

[7]  Michel Verleysen,et al.  Resampling methods for parameter-free and robust feature selection with mutual information , 2007, Neurocomputing.

[8]  C. E. SHANNON,et al.  A mathematical theory of communication , 1948, MOCO.

[9]  Michael E. Andrew,et al.  k-Nearest Neighbor Based Consistent Entropy Estimation for Hyperspherical Distributions , 2011, Entropy.

[10]  M. Rudemo Empirical Choice of Histograms and Kernel Density Estimators , 1982 .

[11]  A. Kraskov,et al.  Estimating mutual information. , 2003, Physical review. E, Statistical, nonlinear, and soft matter physics.

[12]  Michel Verleysen,et al.  Information-theoretic feature selection for functional data classification , 2009, Neurocomputing.

[13]  Yan Li,et al.  Estimation of Mutual Information: A Survey , 2009, RSKT.

[14]  A. Bowman An alternative method of cross-validation for the smoothing of density estimates , 1984 .

[15]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[16]  Fuhui Long,et al.  Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy , 2003, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[17]  E. Parzen On Estimation of a Probability Density Function and Mode , 1962 .

[18]  Igor Vajda,et al.  Estimation of the Information by an Adaptive Partitioning of the Observation Space , 1999, IEEE Trans. Inf. Theory.

[19]  Michel Verleysen,et al.  Representation of functional data in neural networks , 2005, Neurocomputing.

[20]  B. Silverman Density estimation for statistics and data analysis , 1986 .

[21]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[22]  Herbert A. Sturges,et al.  The Choice of a Class Interval , 1926 .