Conditional Likelihood Maximisation: A Unifying Framework for Information Theoretic Feature Selection

We present a unifying framework for information theoretic feature selection, bringing almost two decades of research on heuristic filter criteria under a single theoretical interpretation. This is in response to the question: "what are the implicit statistical assumptions of feature selection criteria based on mutual information?". To answer this, we adopt a different strategy than is usual in the feature selection literature--instead of trying to define a criterion, we derive one, directly from a clearly specified objective function: the conditional likelihood of the training labels. While many hand-designed heuristic criteria try to optimize a definition of feature 'relevancy' and 'redundancy', our approach leads to a probabilistic framework which naturally incorporates these concepts. As a result we can unify the numerous criteria published over the last two decades, and show them to be low-order approximations to the exact (but intractable) optimisation problem. The primary contribution is to show that common heuristics for information based feature selection (including Markov Blanket algorithms as a special case) are approximate iterative maximisers of the conditional likelihood. A large empirical study provides strong evidence to favour certain classes of criteria, in particular those that balance the relative size of the relevancy/redundancy terms. Overall we conclude that the JMI criterion (Yang and Moody, 1999; Meyer et al., 2008) provides the best tradeoff in terms of accuracy, stability, and flexibility with small data samples.

[1]  Chris H. Q. Ding,et al.  Stable feature selection via dense feature groups , 2008, KDD.

[2]  Fuhui Long,et al.  Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy , 2003, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[3]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[4]  F. Fleuret Fast Binary Feature Selection with Conditional Mutual Information , 2004, J. Mach. Learn. Res..

[5]  Gavin Brown,et al.  A New Perspective for Information Theoretic Feature Selection , 2009, AISTATS.

[6]  Ron Kohavi,et al.  Wrappers for Feature Subset Selection , 1997, Artif. Intell..

[7]  Martin E. Hellman,et al.  Probability of error, equivocation, and the Chernoff bound , 1970, IEEE Trans. Inf. Theory.

[8]  Tom Minka,et al.  Principled Hybrids of Generative and Discriminative Models , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[9]  David Maxwell Chickering,et al.  Large-Sample Learning of Bayesian Networks is NP-Hard , 2002, J. Mach. Learn. Res..

[10]  Yong Wang,et al.  Conditional Mutual Information‐Based Feature Selection Analyzing for Synergy and Redundancy , 2011, ETRI Journal.

[11]  Daphne Koller,et al.  Toward Optimal Feature Selection , 1996, ICML.

[12]  Ludmila I. Kuncheva,et al.  A stability index for feature selection , 2007, Artificial Intelligence and Applications.

[13]  Driss Aboutajdine,et al.  A Powerful Feature Selection approach based on Mutual Information , 2008 .

[14]  G. A. Barnard,et al.  Transmission of Information: A Statistical Theory of Communications. , 1961 .

[15]  A. Thomasian Review of 'Transmission of Information, A Statistical Theory of Communications' (Fano, R. M.; 1961) , 1962 .

[16]  Isabelle Guyon,et al.  Design of experiments for the NIPS 2003 variable selection benchmark , 2003 .

[17]  Huan Liu,et al.  Efficient Feature Selection via Analysis of Relevance and Redundancy , 2004, J. Mach. Learn. Res..

[18]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[19]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[20]  Peter J. Fleming,et al.  On the Performance Assessment and Comparison of Stochastic Multiobjective Optimizers , 1996, PPSN.

[21]  David D. Lewis,et al.  Feature Selection and Feature Extraction for Text Categorization , 1992, HLT.

[22]  Mark S. Nixon,et al.  Gait Feature Subset Selection by Mutual Information , 2007, 2007 First IEEE International Conference on Biometrics: Theory, Applications, and Systems.

[23]  M. Tesmer,et al.  AMIFS: adaptive feature selection by using mutual information , 2004, 2004 IEEE International Joint Conference on Neural Networks (IEEE Cat. No.04CH37541).

[24]  Colas Schretter,et al.  Information-Theoretic Feature Selection in Microarray Data Using Variable Complementarity , 2008, IEEE Journal of Selected Topics in Signal Processing.

[25]  Roberto Battiti,et al.  Using mutual information for selecting features in supervised neural net learning , 1994, IEEE Trans. Neural Networks.

[26]  Sayan Mukherjee,et al.  Feature Selection for SVMs , 2000, NIPS.

[27]  Constantin F. Aliferis,et al.  Towards Principled Feature Selection: Relevancy, Filters and Wrappers , 2003 .

[28]  Aleks Jakulin Machine Learning Based on Attribute Interactions , 2005 .

[29]  Chong-Ho Choi,et al.  Input feature selection for classification problems , 2002, IEEE Trans. Neural Networks.

[30]  Pedro M. Domingos,et al.  Learning Bayesian network classifiers by maximizing conditional likelihood , 2004, ICML.

[31]  Shimon Ullman,et al.  Object recognition with informative features and linear classification , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[32]  Masoud Nikravesh,et al.  Feature Extraction: Foundations and Applications (Studies in Fuzziness and Soft Computing) , 2006 .

[33]  Constantin F. Aliferis,et al.  Algorithms for Large Scale Markov Blanket Discovery , 2003, FLAIRS.

[34]  Liam Paninski,et al.  Estimation of Entropy and Mutual Information , 2003, Neural Computation.

[35]  Gavin Brown,et al.  Informative Priors for Markov Blanket Discovery , 2012, AISTATS.

[36]  Vir V. Phoha,et al.  On the Feature Selection Criterion Based on an Approximation of Multidimensional Mutual Information , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[37]  Melanie Hilario,et al.  Knowledge and Information Systems , 2007 .

[38]  John E. Moody,et al.  Data Visualization and Feature Selection: New Algorithms for Nongaussian Data , 1999, NIPS.

[39]  Gianluca Bontempi,et al.  On the Use of Variable Complementarity for Feature Selection in Cancer Classification , 2006, EvoWorkshops.

[40]  Lucas C. Parra,et al.  Maximum Likelihood in Cost-Sensitive Learning: Model Specification, Approximations, and Upper Bounds , 2010, J. Mach. Learn. Res..

[41]  Lei Yu,et al.  Stable and Accurate Feature Selection , 2009, ECML/PKDD.

[42]  Masoud Nikravesh,et al.  Feature Extraction - Foundations and Applications , 2006, Feature Extraction.

[43]  T. Minka Discriminative models, not discriminative training , 2005 .

[44]  Dahua Lin,et al.  Conditional Infomax Learning: An Integrated Framework for Feature Extraction and Fusion , 2006, ECCV.