Feature Selection via L1-Penalized Squared-Loss Mutual Information

Feature selection is a technique to screen out less important features. Many existing supervised feature selection algorithms use redundancy and relevancy as the main criteria to select features. However, feature interaction, potentially a key characteristic in real-world problems, has not received much attention. As an attempt to take feature interaction into account, we propose L1-LSMI, an L1-regularization based algorithm that maximizes a squared-loss variant of mutual information between selected features and outputs. Numerical results show that L1-LSMI performs well in handling redundancy, detecting non-linear dependency, and considering feature interaction.

[1]  Masashi Sugiyama,et al.  Feature Selection for Reinforcement Learning: Evaluating Implicit State-Reward Dependency via Conditional Mutual Information , 2010, ECML/PKDD.

[2]  Ron Kohavi,et al.  Wrappers for Feature Subset Selection , 1997, Artif. Intell..

[3]  Yoram Singer,et al.  Efficient projections onto the l1-ball for learning in high dimensions , 2008, ICML '08.

[4]  Huan Liu,et al.  Spectral feature selection for supervised and unsupervised learning , 2007, ICML '07.

[5]  Masashi Sugiyama,et al.  Least-Squares Independent Component Analysis , 2011, Neural Computation.

[6]  Masashi Sugiyama,et al.  Sufficient Dimension Reduction via Squared-Loss Mutual Information Estimation , 2010, Neural Computation.

[7]  Igor Vajda,et al.  On Divergences and Informations in Statistics and Information Theory , 2006, IEEE Transactions on Information Theory.

[8]  Jennifer G. Dy,et al.  From Transformation-Based Dimensionality Reduction to Feature Selection , 2010, ICML.

[9]  Larry A. Rendell,et al.  A Practical Approach to Feature Selection , 1992, ML.

[10]  Bernhard Schölkopf,et al.  Learning with kernels , 2001 .

[11]  N. Aronszajn Theory of Reproducing Kernels. , 1950 .

[12]  Yiming Yang,et al.  From Lasso regression to Feature vector machine , 2005, NIPS.

[13]  Igor Kononenko,et al.  Estimating Attributes: Analysis and Extensions of RELIEF , 1994, ECML.

[14]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[15]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[16]  R. A. Leibler,et al.  On Information and Sufficiency , 1951 .

[17]  Lei Wang,et al.  Efficient Spectral Feature Selection with Minimum Redundancy , 2010, AAAI.

[18]  Bernhard Schölkopf,et al.  Measuring Statistical Dependence with Hilbert-Schmidt Norms , 2005, ALT.

[19]  Michael I. Jordan,et al.  Dimensionality Reduction for Supervised Learning with Reproducing Kernel Hilbert Spaces , 2004, J. Mach. Learn. Res..

[20]  Mark W. Schmidt,et al.  Fast Optimization Methods for L1 Regularization: A Comparative Study and Two New Approaches , 2007, ECML.

[21]  Honglak Lee,et al.  Efficient L1 Regularized Logistic Regression , 2006, AAAI.

[22]  Pedro Larrañaga,et al.  A review of feature selection techniques in bioinformatics , 2007, Bioinform..

[23]  Deng Cai,et al.  Laplacian Score for Feature Selection , 2005, NIPS.

[24]  P. Langley Selection of Relevant Features in Machine Learning , 1994 .

[25]  Ingo Steinwart,et al.  On the Influence of the Kernel on the Consistency of Support Vector Machines , 2002, J. Mach. Learn. Res..

[26]  S. M. Ali,et al.  A General Class of Coefficients of Divergence of One Distribution from Another , 1966 .

[27]  Michael Zuba,et al.  1-norm support vector machine for college drinking risk factor identification , 2012, IHI '12.

[28]  Fuhui Long,et al.  Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy , 2003, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[29]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[30]  Bernhard Schölkopf,et al.  Use of the Zero-Norm with Linear Models and Kernel Methods , 2003, J. Mach. Learn. Res..

[31]  Takafumi Kanamori,et al.  Mutual information estimation reveals global associations between stimuli and biological processes , 2009, BMC Bioinformatics.

[32]  Jieping Ye,et al.  Large-scale sparse logistic regression , 2009, KDD.

[33]  Takafumi Kanamori,et al.  Approximating Mutual Information by Maximum Likelihood Density Ratio Estimation , 2008, FSDM.

[34]  Charles Elkan,et al.  Quadratic Programming Feature Selection , 2010, J. Mach. Learn. Res..

[35]  Mark A. Hall,et al.  Correlation-based Feature Selection for Discrete and Numeric Class Machine Learning , 1999, ICML.

[36]  Le Song,et al.  Supervised feature selection via dependence estimation , 2007, ICML '07.

[37]  Kari Torkkola,et al.  Feature Extraction by Non-Parametric Mutual Information Maximization , 2003, J. Mach. Learn. Res..

[38]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .