Feature Selection via Mutual Information: New Theoretical Insights

Mutual information has been successfully adopted in filter feature-selection methods to assess both the relevancy of a subset of features in predicting the target variable and the redundancy with respect to other variables. However, existing algorithms are mostly heuristic and do not offer any guarantee on the proposed solution. In this paper, we provide novel theoretical results showing that conditional mutual information naturally arises when bounding the ideal regression/classification errors achieved by different subsets of features. Leveraging on these insights, we propose a novel stopping condition for backward and forward greedy methods which ensures that the ideal prediction error using the selected feature subset remains bounded by a user-specified threshold. We provide numerical simulations to support our theoretical claims and compare to common heuristic methods.

[1]  Amiel Feinstein,et al.  Information and information stability of random variables and processes , 1964 .

[2]  S. Kullback,et al.  A lower bound for discrimination information in terms of variation (Corresp.) , 1967, IEEE Trans. Inf. Theory.

[3]  Subrata K. Das,et al.  Feature Selection with a Linear Dependence Measure , 1971, IEEE Transactions on Computers.

[4]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[5]  David D. Lewis,et al.  Feature Selection and Feature Extraction for Text Categorization , 1992, HLT.

[6]  Roberto Battiti,et al.  Using mutual information for selecting features in supervised neural net learning , 1994, IEEE Trans. Neural Networks.

[7]  Ron Kohavi,et al.  Irrelevant Features and the Subset Selection Problem , 1994, ICML.

[8]  Ron Kohavi,et al.  Wrappers for Feature Subset Selection , 1997, Artif. Intell..

[9]  John E. Moody,et al.  Data Visualization and Feature Selection: New Algorithms for Nongaussian Data , 1999, NIPS.

[10]  Sayan Mukherjee,et al.  Feature Selection for SVMs , 2000, NIPS.

[11]  Constantin F. Aliferis,et al.  Algorithms for Large Scale Markov Blanket Discovery , 2003, FLAIRS.

[12]  Mark A. Hall,et al.  Correlation-based Feature Selection for Machine Learning , 2003 .

[13]  Liam Paninski,et al.  Estimation of Entropy and Mutual Information , 2003, Neural Computation.

[14]  Huan Liu,et al.  Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution , 2003, ICML.

[15]  Wlodzislaw Duch,et al.  Feature Selection and Ranking Filters , 2003 .

[16]  A. Kraskov,et al.  Estimating mutual information. , 2003, Physical review. E, Statistical, nonlinear, and soft matter physics.

[17]  F. Fleuret Fast Binary Feature Selection with Conditional Mutual Information , 2004, J. Mach. Learn. Res..

[18]  Wlodzislaw Duch Filter methods , 2004 .

[19]  Dahua Lin,et al.  Conditional Infomax Learning: An Integrated Framework for Feature Extraction and Fusion , 2006, ECCV.

[20]  Jason Weston,et al.  Embedded Methods , 2006, Feature Extraction.

[21]  Wlodzislaw Duch,et al.  Feature Selection for High-Dimensional Data - A Pearson Redundancy Based Filter , 2008, Computer Recognition Systems 2.

[22]  Francesca Odone,et al.  Feature selection for high-dimensional data , 2009, Comput. Manag. Sci..

[23]  Yong Wang,et al.  Conditional Mutual Information‐Based Feature Selection Analyzing for Synergy and Redundancy , 2011, ETRI Journal.

[24]  Dimitris Kugiumtzis,et al.  Nearest neighbor estimate of conditional mutual information in feature selection , 2012, Expert Syst. Appl..

[25]  Gavin Brown,et al.  Conditional Likelihood Maximisation: A Unifying Framework for Information Theoretic Feature Selection , 2012, J. Mach. Learn. Res..

[26]  Sergio Verdú,et al.  Functional Properties of Minimum Mean-Square Error and Mutual Information , 2012, IEEE Transactions on Information Theory.

[27]  Tai-hoon Kim,et al.  Linear Correlation-Based Feature Selection for Network Intrusion Detection Model , 2013, SecNet.

[28]  Pablo A. Estévez,et al.  A review of feature selection methods based on mutual information , 2013, Neural Computing and Applications.

[29]  Ferat Sahin,et al.  A survey on feature selection methods , 2014, Comput. Electr. Eng..

[30]  Sreeram Kannan,et al.  Estimating Mutual Information for Discrete-Continuous Mixtures , 2017, NIPS.

[31]  Martin J. Wainwright,et al.  Kernel Feature Selection via Conditional Covariance Minimization , 2017, NIPS.