论文信息 - Conditional Likelihood Maximisation: A Unifying Framework for Information Theoretic Feature Selection

Conditional Likelihood Maximisation: A Unifying Framework for Information Theoretic Feature Selection

We present a unifying framework for information theoretic feature selection, bringing almost two decades of research on heuristic filter criteria under a single theoretical interpretation. This is in response to the question: "what are the implicit statistical assumptions of feature selection criteria based on mutual information?". To answer this, we adopt a different strategy than is usual in the feature selection literature--instead of trying to define a criterion, we derive one, directly from a clearly specified objective function: the conditional likelihood of the training labels. While many hand-designed heuristic criteria try to optimize a definition of feature 'relevancy' and 'redundancy', our approach leads to a probabilistic framework which naturally incorporates these concepts. As a result we can unify the numerous criteria published over the last two decades, and show them to be low-order approximations to the exact (but intractable) optimisation problem. The primary contribution is to show that common heuristics for information based feature selection (including Markov Blanket algorithms as a special case) are approximate iterative maximisers of the conditional likelihood. A large empirical study provides strong evidence to favour certain classes of criteria, in particular those that balance the relative size of the relevancy/redundancy terms. Overall we conclude that the JMI criterion (Yang and Moody, 1999; Meyer et al., 2008) provides the best tradeoff in terms of accuracy, stability, and flexibility with small data samples.

[1] Chris H. Q. Ding,et al. Stable feature selection via dense feature groups , 2008, KDD.

[2] Fuhui Long,et al. Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy , 2003, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[3] Sang Joon Kim,et al. A Mathematical Theory of Communication , 2006 .

[4] F. Fleuret. Fast Binary Feature Selection with Conditional Mutual Information , 2004, J. Mach. Learn. Res..

[5] Gavin Brown,et al. A New Perspective for Information Theoretic Feature Selection , 2009, AISTATS.

[6] Ron Kohavi,et al. Wrappers for Feature Subset Selection , 1997, Artif. Intell..

[7] Martin E. Hellman,et al. Probability of error, equivocation, and the Chernoff bound , 1970, IEEE Trans. Inf. Theory.

[8] Tom Minka,et al. Principled Hybrids of Generative and Discriminative Models , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[9] David Maxwell Chickering,et al. Large-Sample Learning of Bayesian Networks is NP-Hard , 2002, J. Mach. Learn. Res..

[10] Yong Wang,et al. Conditional Mutual Information‐Based Feature Selection Analyzing for Synergy and Redundancy , 2011, ETRI Journal.

[11] Daphne Koller,et al. Toward Optimal Feature Selection , 1996, ICML.

[12] Ludmila I. Kuncheva,et al. A stability index for feature selection , 2007, Artificial Intelligence and Applications.

[13] Driss Aboutajdine,et al. A Powerful Feature Selection approach based on Mutual Information , 2008 .

[14] G. A. Barnard,et al. Transmission of Information: A Statistical Theory of Communications. , 1961 .

[15] A. Thomasian. Review of 'Transmission of Information, A Statistical Theory of Communications' (Fano, R. M.; 1961) , 1962 .

[16] Isabelle Guyon,et al. Design of experiments for the NIPS 2003 variable selection benchmark , 2003 .

[17] Huan Liu,et al. Efficient Feature Selection via Analysis of Relevance and Redundancy , 2004, J. Mach. Learn. Res..

[18] Janez Demsar,et al. Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[19] Thomas M. Cover,et al. Elements of Information Theory , 2005 .

[20] Peter J. Fleming,et al. On the Performance Assessment and Comparison of Stochastic Multiobjective Optimizers , 1996, PPSN.

[21] David D. Lewis,et al. Feature Selection and Feature Extraction for Text Categorization , 1992, HLT.

[22] Mark S. Nixon,et al. Gait Feature Subset Selection by Mutual Information , 2007, 2007 First IEEE International Conference on Biometrics: Theory, Applications, and Systems.

[23] M. Tesmer,et al. AMIFS: adaptive feature selection by using mutual information , 2004, 2004 IEEE International Joint Conference on Neural Networks (IEEE Cat. No.04CH37541).

[24] Colas Schretter,et al. Information-Theoretic Feature Selection in Microarray Data Using Variable Complementarity , 2008, IEEE Journal of Selected Topics in Signal Processing.

[25] Roberto Battiti,et al. Using mutual information for selecting features in supervised neural net learning , 1994, IEEE Trans. Neural Networks.

[26] Sayan Mukherjee,et al. Feature Selection for SVMs , 2000, NIPS.

[27] Constantin F. Aliferis,et al. Towards Principled Feature Selection: Relevancy, Filters and Wrappers , 2003 .

[28] Aleks Jakulin. Machine Learning Based on Attribute Interactions , 2005 .

[29] Chong-Ho Choi,et al. Input feature selection for classification problems , 2002, IEEE Trans. Neural Networks.

[30] Pedro M. Domingos,et al. Learning Bayesian network classifiers by maximizing conditional likelihood , 2004, ICML.

[31] Shimon Ullman,et al. Object recognition with informative features and linear classification , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[32] Masoud Nikravesh,et al. Feature Extraction: Foundations and Applications (Studies in Fuzziness and Soft Computing) , 2006 .

[33] Constantin F. Aliferis,et al. Algorithms for Large Scale Markov Blanket Discovery , 2003, FLAIRS.

[34] Liam Paninski,et al. Estimation of Entropy and Mutual Information , 2003, Neural Computation.

[35] Gavin Brown,et al. Informative Priors for Markov Blanket Discovery , 2012, AISTATS.

[36] Vir V. Phoha,et al. On the Feature Selection Criterion Based on an Approximation of Multidimensional Mutual Information , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[37] Melanie Hilario,et al. Knowledge and Information Systems , 2007 .

[38] John E. Moody,et al. Data Visualization and Feature Selection: New Algorithms for Nongaussian Data , 1999, NIPS.

[39] Gianluca Bontempi,et al. On the Use of Variable Complementarity for Feature Selection in Cancer Classification , 2006, EvoWorkshops.

[40] Lucas C. Parra,et al. Maximum Likelihood in Cost-Sensitive Learning: Model Specification, Approximations, and Upper Bounds , 2010, J. Mach. Learn. Res..

[41] Lei Yu,et al. Stable and Accurate Feature Selection , 2009, ECML/PKDD.

[42] Masoud Nikravesh,et al. Feature Extraction - Foundations and Applications , 2006, Feature Extraction.

[43] T. Minka. Discriminative models, not discriminative training , 2005 .

[44] Dahua Lin,et al. Conditional Infomax Learning: An Integrated Framework for Feature Extraction and Fusion , 2006, ECCV.