Structured Variable Selection with Sparsity-Inducing Norms

We consider the empirical risk minimization problem for linear supervised learning, with regularization by structured sparsity-inducing norms. These are defined as sums of Euclidean norms on certain subsets of variables, extending the usual l1-norm and the group l1-norm by allowing the subsets to overlap. This leads to a specific set of allowed nonzero patterns for the solutions of such problems. We first explore the relationship between the groups defining the norm and the resulting nonzero patterns, providing both forward and backward algorithms to go back and forth from groups to patterns. This allows the design of norms adapted to specific prior knowledge expressed in terms of nonzero patterns. We also present an efficient active set algorithm, and analyze the consistency of variable selection for least-squares linear regression in low and high-dimensional settings.

[1]  Francis R. Bach,et al.  High-Dimensional Non-Linear Variable Selection through Hierarchical Kernel Learning , 2009, ArXiv.

[2]  R. Adler An introduction to continuity, extrema, and related topics for general Gaussian processes , 1990 .

[3]  H. Zou The Adaptive Lasso and Its Oracle Properties , 2006 .

[4]  Guo-Xun Yuan A Comparison of Optimization Methods for Large-scale L 1-regularized Linear Classification , 2010 .

[5]  Guillermo Sapiro,et al.  Online Learning for Matrix Factorization and Sparse Coding , 2009, J. Mach. Learn. Res..

[6]  Peng Zhao,et al.  On Model Selection Consistency of Lasso , 2006, J. Mach. Learn. Res..

[7]  Junzhou Huang,et al.  Learning with structured sparsity , 2009, ICML '09.

[8]  Kim-Chuan Toh,et al.  Solving semidefinite-quadratic-linear programs using SDPT3 , 2003, Math. Program..

[9]  Bertrand Thirion,et al.  Multi-scale Mining of fMRI Data with Hierarchical Structured Sparsity , 2011, 2011 International Workshop on Pattern Recognition in NeuroImaging.

[10]  R. Tibshirani,et al.  Least angle regression , 2004, math/0406456.

[11]  Junzhou Huang,et al.  The Benefit of Group Sparsity , 2009 .

[12]  J. Borwein,et al.  Convex Analysis And Nonlinear Optimization , 2000 .

[13]  Julien Mairal,et al.  Proximal Methods for Sparse Hierarchical Dictionary Learning , 2010, ICML.

[14]  Francis R. Bach,et al.  Self-concordant analysis for logistic regression , 2009, ArXiv.

[15]  Mark W. Schmidt,et al.  Convex Structure Learning in Log-Linear Models: Beyond Pairwise Potentials , 2010, AISTATS.

[16]  P. Massart,et al.  Concentration inequalities and model selection , 2007 .

[17]  Marc Teboulle,et al.  A Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse Problems , 2009, SIAM J. Imaging Sci..

[18]  Martin J. Wainwright,et al.  A unified framework for high-dimensional analysis of $M$-estimators with decomposable regularizers , 2009, NIPS.

[19]  Volkan Cevher,et al.  Sparse Signal Recovery Using Markov Random Fields , 2008, NIPS.

[20]  Marc'Aurelio Ranzato,et al.  Learning invariant features through topographic filter maps , 2009, CVPR.

[21]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[22]  Francis R. Bach,et al.  Structured Sparse Principal Component Analysis , 2009, AISTATS.

[23]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[24]  Kim-Chuan Toh,et al.  SDPT3 -- A Matlab Software Package for Semidefinite Programming , 1996 .

[25]  Lawrence Carin,et al.  Exploiting Structure in Wavelet-Based Bayesian Compressive Sensing , 2009, IEEE Transactions on Signal Processing.

[26]  M. Yuan,et al.  Model selection and estimation in regression with grouped variables , 2006 .

[27]  Volkan Cevher,et al.  Model-Based Compressive Sensing , 2008, IEEE Transactions on Information Theory.

[28]  Rajat Raina,et al.  Efficient sparse coding algorithms , 2006, NIPS.

[29]  Tong Zhang Some sharp performance bounds for least squares regression with L1 regularization , 2009, 0908.2869.

[30]  Adam J. Rothman,et al.  Sparse estimation of large covariance matrices via a nested Lasso penalty , 2008, 0803.3872.

[31]  Francis R. Bach,et al.  Consistency of the group Lasso and multiple kernel learning , 2007, J. Mach. Learn. Res..

[32]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[33]  Patrick L. Combettes,et al.  Proximal Splitting Methods in Signal Processing , 2009, Fixed-Point Algorithms for Inverse Problems in Science and Engineering.

[34]  Emmanuel Barillot,et al.  Classification of arrayCGH data using fused SVM , 2008, ISMB.

[35]  M. R. Osborne,et al.  On the LASSO and its Dual , 2000 .

[36]  Charles A. Micchelli,et al.  Learning the Kernel Function via Regularization , 2005, J. Mach. Learn. Res..

[37]  Ben Taskar,et al.  Joint covariate selection and joint subspace selection for multiple classification problems , 2010, Stat. Comput..

[38]  Massimiliano Pontil,et al.  Convex multi-task feature learning , 2008, Machine Learning.

[39]  Julien Mairal,et al.  Convex and Network Flow Optimization for Structured Sparsity , 2011, J. Mach. Learn. Res..

[40]  A. Rinaldo,et al.  On the asymptotic properties of the group lasso estimator for linear models , 2008 .

[41]  Wenjiang J. Fu,et al.  Asymptotics for lasso-type estimators , 2000 .

[42]  Francis R. Bach,et al.  Exploring Large Feature Spaces with Hierarchical Multiple Kernel Learning , 2008, NIPS.

[43]  Jean-Claude Falmagne,et al.  Knowledge spaces , 1998 .

[44]  Peter J. Ramadge,et al.  Boosting with Spatial Regularization , 2009, NIPS.

[45]  Paul Tseng,et al.  Approximation accuracy, gradient methods, and error bound for structured convex optimization , 2010, Math. Program..

[46]  Y. Nesterov Gradient methods for minimizing composite objective function , 2007 .

[47]  Noah A. Smith,et al.  Structured Sparsity in Structured Prediction , 2011, EMNLP.

[48]  Dimitri P. Bertsekas,et al.  Nonlinear Programming , 1997 .

[49]  P. Zhao,et al.  Grouped and Hierarchical Model Selection through Composite Absolute Penalties , 2007 .

[50]  Francis R. Bach,et al.  Bolasso: model consistent Lasso estimation through the bootstrap , 2008, ICML '08.

[51]  Tso-Jung Yen,et al.  Discussion on "Stability Selection" by Meinshausen and Buhlmann , 2010 .

[52]  Jean-Philippe Vert,et al.  Group lasso with overlap and graph lasso , 2009, ICML '09.

[53]  Cordelia Schmid,et al.  Human Detection Using Oriented Histograms of Flow and Appearance , 2006, ECCV.

[54]  Julien Mairal,et al.  Network Flow Algorithms for Structured Sparsity , 2010, NIPS.

[55]  U. Feige,et al.  Spectral Graph Theory , 2015 .

[56]  Julien Mairal,et al.  Proximal Methods for Hierarchical Sparse Coding , 2010, J. Mach. Learn. Res..

[57]  Pierre Soille,et al.  Morphological Image Analysis: Principles and Applications , 2003 .

[58]  Chih-Jen Lin,et al.  A Comparison of Optimization Methods and Software for Large-scale L1-regularized Linear Classification , 2010, J. Mach. Learn. Res..

[59]  N. Meinshausen,et al.  Stability selection , 2008, 0809.2932.

[60]  Volker Roth,et al.  The Group-Lasso for generalized linear models: uniqueness of solutions and efficient algorithms , 2008, ICML '08.

[61]  Cordelia Schmid,et al.  Combining efficient object localization and image classification , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[62]  P. Zhao,et al.  The composite absolute penalties family for grouped and hierarchical variable selection , 2009, 0909.0411.

[63]  Julien Mairal,et al.  Convex optimization with sparsity-inducing norms , 2011 .

[64]  丸山 徹 Convex Analysisの二,三の進展について , 1977 .

[65]  U. Provence IMPROVING M/EEG SOURCE LOCALIZATION WITH AN INTER-CONDITION SPARSE PRIOR , 2009 .

[66]  Jean-Jacques Fuchs,et al.  Recovery of exact sparse representations in the presence of bounded noise , 2005, IEEE Transactions on Information Theory.

[67]  Robert L. Patten Combinatorics: Topics, Techniques, Algorithms , 1995 .

[68]  C. Dossal A necessary and sufficient condition for exact recovery by l1 minimization. , 2012 .

[69]  P. Bickel,et al.  SIMULTANEOUS ANALYSIS OF LASSO AND DANTZIG SELECTOR , 2008, 0801.1095.

[70]  Martin J. Wainwright,et al.  Sharp Thresholds for High-Dimensional and Noisy Sparsity Recovery Using $\ell _{1}$ -Constrained Quadratic Programming (Lasso) , 2009, IEEE Transactions on Information Theory.

[71]  Ji Zhu,et al.  Boosting as a Regularized Path to a Maximum Margin Classifier , 2004, J. Mach. Learn. Res..

[72]  Matthieu Kowalski,et al.  Improving M/EEG source localizationwith an inter-condition sparse prior , 2009, 2009 IEEE International Symposium on Biomedical Imaging: From Nano to Macro.

[73]  Eric P. Xing,et al.  Tree-Guided Group Lasso for Multi-Task Regression with Structured Sparsity , 2009, ICML.