Feature selection via joint likelihood

We study the nature of filter methods for feature selection. In particular, we examine information theoretic approaches to this problem, looking at the literature over the past 20 years. We consider this literature from a different perspective, by viewing feature selection as a process which minimises a loss function. We choose to use the model likelihood as the loss function, and thus we seek to maximise the likelihood. The first contribution of this thesis is to show that the problem of information theoretic filter feature selection can be rephrased as maximising the likelihood of a discriminative model. From this novel result we can unify the literature revealing that many of these selection criteria are approximate maximisers of the joint likelihood. Many of these heuristic criteria were hand-designed to optimise various definitions of feature "relevancy" and "redundancy", but with our probabilistic interpretation we naturally include these concepts, plus the "conditional redundancy", which is a measure of positive interactions between features. This perspective allows us to derive the different criteria from the joint likelihood by making different independence assumptions on the underlying probability distributions. We provide an empirical study which reinforces our theoretical conclusions, whilst revealing implementation considerations due to the varying magnitudes of the relevancy and redundancy terms.We then investigate the benefits our probabilistic perspective provides for the application of these feature selection criteria in new areas. The joint likelihood automatically includes a prior distribution over the selected feature sets and so we investigate how including prior knowledge affects the feature selection process. We can now incorporate domain knowledge into feature selection, allowing the imposition of sparsity on the selected feature set without using heuristic stopping criteria. We investigate the use of priors mainly in the context of Markov Blanket discovery algorithms, in the process showing that a family of algorithms based upon IAMB are iterative maximisers of our joint likelihood with respect to a particular sparsity prior. We thus extend the IAMB family to include a prior for domain knowledge in addition to the sparsity prior.Next we investigate what the choice of likelihood function implies about the resulting filter criterion. We do this by applying our derivation to a cost-weighted likelihood, showing that this likelihood implies a particular cost-sensitive filter criterion. This criterion is based on a weighted branch of information theory and we prove several novel results justifying its use as a feature selection criterion, namely the positivity of the measure, and the chain rule of mutual information. We show that the feature set produced by this cost-sensitive filter criterion can be used to convert a cost-insensitive classifier into a cost-sensitive one by adjusting the features the classifier sees. This can be seen as an analogous process to that of adjusting the data via over or undersampling to create a cost-sensitive classifier, but with the crucial difference that it does not artificially alter the data distribution.Finally we conclude with a summary of the benefits this loss function view of feature selection has provided. This perspective can be used to analyse other feature selection techniques other than those based upon information theory, and new groups of selection criteria can be derived by considering novel loss functions.

[1]  F. Fleuret Fast Binary Feature Selection with Conditional Mutual Information , 2004, J. Mach. Learn. Res..

[2]  Tom Minka,et al.  Principled Hybrids of Generative and Discriminative Models , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[3]  John Langford,et al.  Cost-sensitive learning by cost-proportionate example weighting , 2003, Third IEEE International Conference on Data Mining.

[4]  Ludmila I. Kuncheva,et al.  A stability index for feature selection , 2007, Artificial Intelligence and Applications.

[5]  Marko Robnik Experiments with Cost-sensitive Feature Evaluation , 2003 .

[6]  Robert G. Cowell,et al.  Conditions Under Which Conditional Independence and Scoring Methods Lead to Identical Selection of Bayesian Network Models , 2001, UAI.

[7]  Constantin F. Aliferis,et al.  Time and sample efficient discovery of Markov blankets and direct causal relations , 2003, KDD '03.

[8]  J. Rissanen Stochastic Complexity and Modeling , 1986 .

[9]  William J. McGill Multivariate information transmission , 1954, Trans. IRE Prof. Group Inf. Theory.

[10]  Stuart J. Russell,et al.  Adaptive Probabilistic Networks with Hidden Variables , 1997, Machine Learning.

[11]  Pedro M. Domingos,et al.  Learning Bayesian network classifiers by maximizing conditional likelihood , 2004, ICML.

[12]  Gianluca Bontempi,et al.  On the Use of Variable Complementarity for Feature Selection in Cancer Classification , 2006, EvoWorkshops.

[13]  Gavin Brown,et al.  Conditional Likelihood Maximisation: A Unifying Framework for Information Theoretic Feature Selection , 2012, J. Mach. Learn. Res..

[14]  Gavin Brown Some Thoughts at the Interface of Ensemble Methods and Feature Selection , 2010, MCS.

[15]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems - networks of plausible inference , 1991, Morgan Kaufmann series in representation and reasoning.

[16]  Silviu Guiaşu,et al.  Information theory with applications , 1977 .

[17]  Kristian Kristensen,et al.  The use of a Bayesian network in the design of a decision support system for growing malting barley without use of pesticides , 2002 .

[18]  John E. Moody,et al.  Data Visualization and Feature Selection: New Algorithms for Nongaussian Data , 1999, NIPS.

[19]  Wlodzislaw Duch Filter methods , 2004 .

[20]  Michael C. Hout,et al.  Multidimensional Scaling , 2003, Encyclopedic Dictionary of Archaeology.

[21]  Silviu Guiasu,et al.  A quantitative-qualitative measure of information in cybernetic systems (Corresp.) , 1968, IEEE Trans. Inf. Theory.

[22]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[23]  A. Raftery,et al.  Strictly Proper Scoring Rules, Prediction, and Estimation , 2007 .

[24]  Philip M. Lewis,et al.  The characteristic selection problem in recognition systems , 1962, IRE Trans. Inf. Theory.

[25]  Karl Pearson F.R.S. LIII. On lines and planes of closest fit to systems of points in space , 1901 .

[26]  Gavin Brown,et al.  Online Non-stationary Boosting , 2010, MCS.

[27]  Constantin F. Aliferis,et al.  HITON: A Novel Markov Blanket Algorithm for Optimal Variable Selection , 2003, AMIA.

[28]  Lucas C. Parra,et al.  Maximum Likelihood in Cost-Sensitive Learning: Model Specification, Approximations, and Upper Bounds , 2010, J. Mach. Learn. Res..

[29]  Gavin Brown,et al.  A New Perspective for Information Theoretic Feature Selection , 2009, AISTATS.

[30]  Ron Kohavi,et al.  Wrappers for Feature Subset Selection , 1997, Artif. Intell..

[31]  Erik Schaffernicht,et al.  Weighted Mutual Information for Feature Selection , 2011, ICANN.

[32]  Martin E. Hellman,et al.  Probability of error, equivocation, and the Chernoff bound , 1970, IEEE Trans. Inf. Theory.

[33]  Thibault Helleputte,et al.  Partially supervised feature selection with regularized linear models , 2009, ICML '09.

[34]  Pedro M. Domingos MetaCost: a general method for making classifiers cost-sensitive , 1999, KDD '99.

[35]  Driss Aboutajdine,et al.  A Powerful Feature Selection approach based on Mutual Information , 2008 .

[36]  Barnabás Póczos,et al.  Nonparametric Estimation of Conditional Information and Divergences , 2012, AISTATS.

[37]  Shri Kant Machine Learning and Pattern Recognition , 2010 .

[38]  C. E. SHANNON,et al.  A mathematical theory of communication , 1948, MOCO.

[39]  Sebastian Thrun,et al.  Bayesian Network Induction via Local Neighborhoods , 1999, NIPS.

[40]  Liam Paninski,et al.  Estimation of Entropy and Mutual Information , 2003, Neural Computation.

[41]  Constantin F. Aliferis,et al.  Local Causal and Markov Blanket Induction for Causal Discovery and Feature Selection for Classification Part I: Algorithms and Empirical Evaluation , 2010, J. Mach. Learn. Res..

[42]  Peter E. Hart,et al.  Nearest neighbor pattern classification , 1967, IEEE Trans. Inf. Theory.

[43]  Luis M. de Campos,et al.  A Scoring Function for Learning Bayesian Networks based on Mutual Information and Conditional Independence Tests , 2006, J. Mach. Learn. Res..

[44]  Gavin Brown,et al.  Informative Priors for Markov Blanket Discovery , 2012, AISTATS.

[45]  Mill Johannes G.A. Van,et al.  Transmission Of Information , 1961 .

[46]  M. Tesmer,et al.  AMIFS: adaptive feature selection by using mutual information , 2004, 2004 IEEE International Joint Conference on Neural Networks (IEEE Cat. No.04CH37541).

[47]  Marek J. Druzdzel,et al.  Robust Independence Testing for Constraint-Based Learning of Causal Structure , 2002, UAI.

[48]  Aleks Jakulin Machine Learning Based on Attribute Interactions , 2005 .

[49]  C. D. Gelatt,et al.  Optimization by Simulated Annealing , 1983, Science.

[50]  Chong-Ho Choi,et al.  Input feature selection for classification problems , 2002, IEEE Trans. Neural Networks.

[51]  Gavin Brown,et al.  Conditional Likelihood Maximisation: A Unifying Framework for Mutual Information Feature Selection , 2012 .

[52]  George Karypis,et al.  Centroid-Based Document Classification: Analysis and Experimental Results , 2000, PKDD.

[53]  David D. Lewis,et al.  Feature Selection and Feature Extraction for Text Categorization , 1992, HLT.

[54]  Pedro M. Domingos,et al.  On the Optimality of the Simple Bayesian Classifier under Zero-One Loss , 1997, Machine Learning.

[55]  Adam Craig Pocock,et al.  Static Java Program Features for Intelligent Squash Prediction , 2009 .

[56]  Lei Yu,et al.  Stable and Accurate Feature Selection , 2009, ECML/PKDD.

[57]  George Forman,et al.  An Extensive Empirical Study of Feature Selection Metrics for Text Classification , 2003, J. Mach. Learn. Res..

[58]  Sach Mukherjee,et al.  Network inference using informative priors , 2008, Proceedings of the National Academy of Sciences.

[59]  A. H. Murphy,et al.  Hailfinder: A Bayesian system for forecasting severe weather , 1996 .

[60]  David Maxwell Chickering,et al.  Learning Bayesian Networks: The Combination of Knowledge and Statistical Data , 1994, Machine Learning.

[61]  Subhash C. Bagui,et al.  Combining Pattern Classifiers: Methods and Algorithms , 2005, Technometrics.

[62]  Gianluca Bontempi,et al.  Causal filter selection in microarray data , 2010, ICML.

[63]  Teemu Roos,et al.  Discriminative Learning of Bayesian Networks via Factorized Conditional Log-Likelihood , 2011, J. Mach. Learn. Res..

[64]  Chris H. Q. Ding,et al.  Stable feature selection via dense feature groups , 2008, KDD.

[65]  Daphne Koller,et al.  Toward Optimal Feature Selection , 1996, ICML.

[66]  P. Spirtes,et al.  Causation, prediction, and search , 1993 .

[67]  Yong Wang,et al.  Conditional Mutual Information‐Based Feature Selection Analyzing for Synergy and Redundancy , 2011, ETRI Journal.

[68]  Boris Chidlovskii,et al.  Scalable Feature Selection for Multi-class Problems , 2008, ECML/PKDD.

[69]  Fuhui Long,et al.  Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy , 2003, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[70]  Ting Yu,et al.  Incorporating Prior Domain Knowledge into a Kernel Based Feature Selection Algorithm , 2007, PAKDD.

[71]  Constantin F. Aliferis,et al.  Algorithms for Large Scale Markov Blanket Discovery , 2003, FLAIRS.

[72]  Marco Zaffalon,et al.  Distribution of mutual information from complete and incomplete data , 2004, Comput. Stat. Data Anal..

[73]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[74]  Vir V. Phoha,et al.  On the Feature Selection Criterion Based on an Approximation of Multidimensional Mutual Information , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[75]  Amiel Feinstein,et al.  Transmission of Information. , 1962 .

[76]  Charles Elkan,et al.  The Foundations of Cost-Sensitive Learning , 2001, IJCAI.

[77]  Josef Kittler,et al.  Fast branch & bound algorithms for optimal feature selection , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[78]  George Forman,et al.  A pitfall and solution in multi-class feature selection for text classification , 2004, ICML.

[79]  C. Tsallis Possible generalization of Boltzmann-Gibbs statistics , 1988 .

[80]  Naftali Tishby,et al.  Learning to Select Features using their Properties , 2008 .

[81]  R. K. Tuteja,et al.  Characterization of a quantitative-qualitative measure of relative information , 1984, Inf. Sci..

[82]  Aixia Guo,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2014 .

[83]  Dinggang Shen,et al.  Multimodality image registration by maximization of quantitative-qualitative measure of mutual information , 2008, Pattern Recognit..

[84]  Isabelle Guyon,et al.  Design of experiments for the NIPS 2003 variable selection benchmark , 2003 .

[85]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[86]  Constantin F. Aliferis,et al.  Towards Principled Feature Selection: Relevancy, Filters and Wrappers , 2003 .

[87]  Wray L. Buntine Theory Refinement on Bayesian Networks , 1991, UAI.

[88]  Peter J. Fleming,et al.  On the Performance Assessment and Comparison of Stochastic Multiobjective Optimizers , 1996, PPSN.

[89]  Josef Kittler,et al.  Pattern recognition : a statistical approach , 1982 .

[90]  O. Chapelle Multi-Class Feature Selection with Support Vector Machines , 2008 .

[91]  Melanie Hilario,et al.  Knowledge and Information Systems , 2007 .

[92]  Constantin F. Aliferis,et al.  TIED: An Artificially Simulated Dataset with Multiple Markov Boundaries , 2008, NIPS Causality: Objectives and Assessment.

[93]  Yann LeCun,et al.  The mnist database of handwritten digits , 2005 .

[94]  Larry A. Rendell,et al.  The Feature Selection Problem: Traditional Methods and a New Algorithm , 1992, AAAI.

[95]  Shigeo Abe DrEng Pattern Classification , 2001, Springer London.

[96]  Constantin F. Aliferis,et al.  Local Causal and Markov Blanket Induction for Causal Discovery and Feature Selection for Classification Part II: Analysis and Extensions , 2010, J. Mach. Learn. Res..

[97]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[98]  Shimon Ullman,et al.  Object recognition with informative features and linear classification , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[99]  Dimitris Margaritis Toward Provably Correct Feature Selection in Arbitrary Domains , 2009, NIPS.

[100]  Masoud Nikravesh,et al.  Feature Extraction - Foundations and Applications , 2006, Feature Extraction.

[101]  Josef Kittler,et al.  Floating search methods in feature selection , 1994, Pattern Recognit. Lett..

[102]  Tarald O. Kvålseth The relative useful information measure: Some comments , 1991, Inf. Sci..

[103]  T. Minka Discriminative models, not discriminative training , 2005 .

[104]  Dahua Lin,et al.  Conditional Infomax Learning: An Integrated Framework for Feature Extraction and Fusion , 2006, ECCV.

[105]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[106]  Huan Liu,et al.  Efficient Feature Selection via Analysis of Relevance and Redundancy , 2004, J. Mach. Learn. Res..

[107]  Gavin Brown,et al.  Toward a more accurate understanding of the limits of the TLS execution paradigm , 2010, IEEE International Symposium on Workload Characterization (IISWC'10).

[108]  Mark S. Nixon,et al.  Gait Feature Subset Selection by Mutual Information , 2007, 2007 First IEEE International Conference on Biometrics: Theory, Applications, and Systems.

[109]  Roberto Battiti,et al.  Using mutual information for selecting features in supervised neural net learning , 1994, IEEE Trans. Neural Networks.

[110]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[111]  Gregory F. Cooper,et al.  The ALARM Monitoring System: A Case Study with two Probabilistic Inference Techniques for Belief Networks , 1989, AIME.

[112]  Colas Schretter,et al.  Information-Theoretic Feature Selection in Microarray Data Using Variable Complementarity , 2008, IEEE Journal of Selected Topics in Signal Processing.

[113]  M. Robnik-Sikonja Experiments with Cost-Sensitive Feature Evaluation , 2003, European Conference on Machine Learning.

[114]  J. Reunanen Search Strategies , 2021, International Journal of Obesity.