论文信息 - Feature Selection via L1-Penalized Squared-Loss Mutual Information - 字舞流文

Feature Selection via L1-Penalized Squared-Loss Mutual Information

Feature selection is a technique to screen out less important features. Many existing supervised feature selection algorithms use redundancy and relevancy as the main criteria to select features. However, feature interaction, potentially a key characteristic in real-world problems, has not received much attention. As an attempt to take feature interaction into account, we propose L1-LSMI, an L1-regularization based algorithm that maximizes a squared-loss variant of mutual information between selected features and outputs. Numerical results show that L1-LSMI performs well in handling redundancy, detecting non-linear dependency, and considering feature interaction.

Masashi Sugiyama | Wittawat Jitkrittum | Hirotaka Hachiya | Masashi Sugiyama | Wittawat Jitkrittum | H. Hachiya

[1] Masashi Sugiyama,et al. Feature Selection for Reinforcement Learning: Evaluating Implicit State-Reward Dependency via Conditional Mutual Information , 2010, ECML/PKDD.

[2] Ron Kohavi,et al. Wrappers for Feature Subset Selection , 1997, Artif. Intell..

[3] Yoram Singer,et al. Efficient projections onto the l1-ball for learning in high dimensions , 2008, ICML '08.

[4] Huan Liu,et al. Spectral feature selection for supervised and unsupervised learning , 2007, ICML '07.

[5] Masashi Sugiyama,et al. Least-Squares Independent Component Analysis , 2011, Neural Computation.

[6] Masashi Sugiyama,et al. Sufficient Dimension Reduction via Squared-Loss Mutual Information Estimation , 2010, Neural Computation.

[7] Igor Vajda,et al. On Divergences and Informations in Statistics and Information Theory , 2006, IEEE Transactions on Information Theory.

[8] Jennifer G. Dy,et al. From Transformation-Based Dimensionality Reduction to Feature Selection , 2010, ICML.

[9] Larry A. Rendell,et al. A Practical Approach to Feature Selection , 1992, ML.

[10] Bernhard Schölkopf,et al. Learning with kernels , 2001 .

[11] N. Aronszajn. Theory of Reproducing Kernels. , 1950 .

[12] Yiming Yang,et al. From Lasso regression to Feature vector machine , 2005, NIPS.

[13] Igor Kononenko,et al. Estimating Attributes: Analysis and Extensions of RELIEF , 1994, ECML.

[14] Chih-Jen Lin,et al. LIBSVM: A library for support vector machines , 2011, TIST.

[15] Isabelle Guyon,et al. An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[16] R. A. Leibler,et al. On Information and Sufficiency , 1951 .

[17] Lei Wang,et al. Efficient Spectral Feature Selection with Minimum Redundancy , 2010, AAAI.

[18] Bernhard Schölkopf,et al. Measuring Statistical Dependence with Hilbert-Schmidt Norms , 2005, ALT.

[19] Michael I. Jordan,et al. Dimensionality Reduction for Supervised Learning with Reproducing Kernel Hilbert Spaces , 2004, J. Mach. Learn. Res..

[20] Mark W. Schmidt,et al. Fast Optimization Methods for L1 Regularization: A Comparative Study and Two New Approaches , 2007, ECML.

[21] Honglak Lee,et al. Efficient L1 Regularized Logistic Regression , 2006, AAAI.

[22] Pedro Larrañaga,et al. A review of feature selection techniques in bioinformatics , 2007, Bioinform..

[23] Deng Cai,et al. Laplacian Score for Feature Selection , 2005, NIPS.

[24] P. Langley. Selection of Relevant Features in Machine Learning , 1994 .

[25] Ingo Steinwart,et al. On the Influence of the Kernel on the Consistency of Support Vector Machines , 2002, J. Mach. Learn. Res..

[26] S. M. Ali,et al. A General Class of Coefficients of Divergence of One Distribution from Another , 1966 .

[27] Michael Zuba,et al. 1-norm support vector machine for college drinking risk factor identification , 2012, IHI '12.

[28] Fuhui Long,et al. Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy , 2003, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[29] Thomas M. Cover,et al. Elements of Information Theory , 2005 .

[30] Bernhard Schölkopf,et al. Use of the Zero-Norm with Linear Models and Kernel Methods , 2003, J. Mach. Learn. Res..

[31] Takafumi Kanamori,et al. Mutual information estimation reveals global associations between stimuli and biological processes , 2009, BMC Bioinformatics.

[32] Jieping Ye,et al. Large-scale sparse logistic regression , 2009, KDD.

[33] Takafumi Kanamori,et al. Approximating Mutual Information by Maximum Likelihood Density Ratio Estimation , 2008, FSDM.

[34] Charles Elkan,et al. Quadratic Programming Feature Selection , 2010, J. Mach. Learn. Res..

[35] Mark A. Hall,et al. Correlation-based Feature Selection for Discrete and Numeric Class Machine Learning , 1999, ICML.

[36] Le Song,et al. Supervised feature selection via dependence estimation , 2007, ICML '07.

[37] Kari Torkkola,et al. Feature Extraction by Non-Parametric Mutual Information Maximization , 2003, J. Mach. Learn. Res..

[38] R. Tibshirani. Regression Shrinkage and Selection via the Lasso , 1996 .