Structured sparsity-inducing norms through submodular functions

Sparse methods for supervised learning aim at finding good linear predictors from as few variables as possible, i.e., with small cardinality of their supports. This combinatorial selection problem is often turned into a convex optimization problem by replacing the cardinality function by its convex envelope (tightest convex lower bound), in this case the l1-norm. In this paper, we investigate more general set-functions than the cardinality, that may incorporate prior knowledge or structural constraints which are common in many applications: namely, we show that for nondecreasing submodular set-functions, the corresponding convex envelope can be obtained from its Lovasz extension, a common tool in submodular analysis. This defines a family of polyhedral norms, for which we provide generic algorithmic tools (subgradients and proximal operators) and theoretical results (conditions for support recovery or high-dimensional inference). By selecting specific submodular functions, we can give a new interpretation to known norms, such as those based on rank-statistics or grouped norms with potentially overlapping groups; we also define new norms, in particular ones that can be used as non-factorial priors for supervised learning.

[1]  Robert D. Nowak,et al.  Tight Measurement Bounds for Exact Recovery of Structured Sparse Signals , 2011, ArXiv.

[2]  G. Choquet Theory of capacities , 1954 .

[3]  James B. Orlin,et al.  A faster strongly polynomial time algorithm for submodular function minimization , 2007, Math. Program..

[4]  Tommi S. Jaakkola,et al.  Maximum-Margin Matrix Factorization , 2004, NIPS.

[5]  Francis R. Bach,et al.  Trace Lasso: a trace norm regularization for correlated designs , 2011, NIPS.

[6]  Alexander Schrijver,et al.  A Combinatorial Algorithm Minimizing Submodular Functions in Strongly Polynomial Time , 2000, J. Comb. Theory B.

[7]  Satoru Fujishige,et al.  Submodular functions and optimization , 1991 .

[8]  S. Rosset,et al.  Piecewise linear regularized solution paths , 2007, 0708.2197.

[9]  Olga Veksler,et al.  Fast Approximate Energy Minimization via Graph Cuts , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[10]  Junzhou Huang,et al.  Learning with structured sparsity , 2009, ICML '09.

[11]  J. Borwein,et al.  Convex Analysis And Nonlinear Optimization , 2000 .

[12]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[13]  Bertrand Thirion,et al.  Multi-scale Mining of fMRI Data with Hierarchical Structured Sparsity , 2011, 2011 International Workshop on Pattern Recognition in NeuroImaging.

[14]  Jack Edmonds,et al.  Submodular Functions, Matroids, and Certain Polyhedra , 2001, Combinatorial Optimization.

[15]  David P. Wipf,et al.  Sparse Estimation Using General Likelihoods and Non-Factorial Priors , 2009, NIPS.

[16]  Heinz H. Bauschke,et al.  Fixed-Point Algorithms for Inverse Problems in Science and Engineering , 2011, Springer Optimization and Its Applications.

[17]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[18]  Philip Wolfe,et al.  Finding the nearest point in A polytope , 1976, Math. Program..

[19]  M. Nikolova An Algorithm for Total Variation Minimization and Applications , 2004 .

[20]  Jean-Philippe Vert,et al.  Group lasso with overlap and graph lasso , 2009, ICML '09.

[21]  M. Wainwright,et al.  Joint support recovery under high-dimensional scaling: Benefits and perils of ℓ 1,∞ -regularization , 2008, NIPS 2008.

[22]  Marc Teboulle,et al.  A Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse Problems , 2009, SIAM J. Imaging Sci..

[23]  R. Tibshirani,et al.  Sparsity and smoothness via the fused lasso , 2005 .

[24]  D. Donoho,et al.  Atomic Decomposition by Basis Pursuit , 2001 .

[25]  Martin J. Wainwright,et al.  A unified framework for high-dimensional analysis of $M$-estimators with decomposable regularizers , 2009, NIPS.

[26]  Volkan Cevher,et al.  Sparse Signal Recovery Using Markov Random Fields , 2008, NIPS.

[27]  Satoru Iwata,et al.  A combinatorial strongly polynomial algorithm for minimizing submodular functions , 2001, JACM.

[28]  U. Provence IMPROVING M/EEG SOURCE LOCALIZATION WITH AN INTER-CONDITION SPARSE PRIOR , 2009 .

[29]  Andreas Krause,et al.  Submodular Dictionary Selection for Sparse Representation , 2010, ICML.

[30]  C. L. Mallows Some comments on C_p , 1973 .

[31]  Peng Zhao,et al.  On Model Selection Consistency of Lasso , 2006, J. Mach. Learn. Res..

[32]  Charles R. Johnson,et al.  Matrix analysis , 1985, Statistical Inference for Engineers and Data Scientists.

[33]  Colin L. Mallows,et al.  Some Comments on Cp , 2000, Technometrics.

[34]  Guillermo Sapiro,et al.  Non-local sparse models for image restoration , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[35]  László Lovász,et al.  Submodular functions and convexity , 1982, ISMP.

[36]  Trevor Darrell,et al.  An efficient projection for l1, ∞ regularization , 2009, ICML '09.

[37]  Guillermo Sapiro,et al.  Supervised Dictionary Learning , 2008, NIPS.

[38]  S. Fujishige,et al.  A Submodular Function Minimization Algorithm Based on the Minimum-Norm Base ⁄ , 2009 .

[39]  Thomas L. Griffiths,et al.  Hierarchical Topic Models and the Nested Chinese Restaurant Process , 2003, NIPS.

[40]  P. Bickel,et al.  SIMULTANEOUS ANALYSIS OF LASSO AND DANTZIG SELECTOR , 2008, 0801.1095.

[41]  Francis R. Bach,et al.  Convex Relaxation for Combinatorial Penalties , 2012, ArXiv.

[42]  C. Mallows More comments on C p , 1995 .

[43]  Eric P. Xing,et al.  Tree-Guided Group Lasso for Multi-Task Regression with Structured Sparsity , 2009, ICML.

[44]  Y. Nesterov Gradient methods for minimizing composite objective function , 2007 .

[45]  Dimitri P. Bertsekas,et al.  Nonlinear Programming , 1997 .

[46]  Antonin Chambolle,et al.  On Total Variation Minimization and Surface Evolution Using Parametric Maximum Flows , 2009, International Journal of Computer Vision.

[47]  Vahab S. Mirrokni,et al.  Maximizing Non-Monotone Submodular Functions , 2011, 48th Annual IEEE Symposium on Foundations of Computer Science (FOCS'07).

[48]  T. Andô Concavity of certain maps on positive definite matrices and applications to Hadamard products , 1979 .

[49]  Julien Mairal,et al.  Network Flow Algorithms for Structured Sparsity , 2010, NIPS.

[50]  Jean Ponce,et al.  Convex Sparse Matrix Factorizations , 2008, ArXiv.

[51]  Francis R. Bach,et al.  Convex Analysis and Optimization with Submodular Functions: a Tutorial , 2010, ArXiv.

[52]  Antonin Chambolle,et al.  Total Variation Minimization and a Class of Binary MRF Models , 2005, EMMCVPR.

[53]  H. Groenevelt Two algorithms for maximizing a separable concave function over a polymatroid feasible region , 1991 .

[54]  David J. Field,et al.  Sparse coding with an overcomplete basis set: A strategy employed by V1? , 1997, Vision Research.

[55]  Robert D. Nowak,et al.  Signal Reconstruction From Noisy Random Projections , 2006, IEEE Transactions on Information Theory.

[56]  Andreas Krause,et al.  Near-optimal Nonmyopic Value of Information in Graphical Models , 2005, UAI.

[57]  Francis R. Bach,et al.  Structured Variable Selection with Sparsity-Inducing Norms , 2009, J. Mach. Learn. Res..

[58]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[59]  Pablo A. Parrilo,et al.  The Convex Geometry of Linear Inverse Problems , 2010, Foundations of Computational Mathematics.

[60]  Francis R. Bach,et al.  Clusterpath: an Algorithm for Clustering using Convex Fusion Penalties , 2011, ICML.

[61]  Francis R. Bach,et al.  Structured Sparse Principal Component Analysis , 2009, AISTATS.

[62]  Julien Mairal,et al.  Optimization with Sparsity-Inducing Penalties , 2011, Found. Trends Mach. Learn..

[63]  P. Zhao,et al.  Grouped and Hierarchical Model Selection through Composite Absolute Penalties , 2007 .

[64]  Volkan Cevher,et al.  Model-Based Compressive Sensing , 2008, IEEE Transactions on Information Theory.

[65]  Jean-Philippe Vert,et al.  Group Lasso with Overlaps: the Latent Group Lasso approach , 2011, ArXiv.

[66]  Emmanuel Barillot,et al.  Classification of arrayCGH data using fused SVM , 2008, ISMB.

[67]  Emmanuel J. Candès,et al.  Exact Matrix Completion via Convex Optimization , 2008, Found. Comput. Math..

[68]  Michael Elad,et al.  Image Denoising Via Sparse and Redundant Representations Over Learned Dictionaries , 2006, IEEE Transactions on Image Processing.

[69]  Marc'Aurelio Ranzato,et al.  Learning invariant features through topographic filter maps , 2009, CVPR.

[70]  Francis R. Bach,et al.  Exploring Large Feature Spaces with Hierarchical Multiple Kernel Learning , 2008, NIPS.

[71]  Francis Bach,et al.  Shaping Level Sets with Submodular Functions , 2010, NIPS.

[72]  M. L. Fisher,et al.  An analysis of approximations for maximizing submodular set functions—I , 1978, Math. Program..

[73]  Julien Mairal,et al.  Proximal Methods for Sparse Hierarchical Dictionary Learning , 2010, ICML.

[74]  Gene H. Golub,et al.  Matrix computations , 1983 .

[75]  Jeff A. Bilmes,et al.  Submodularity Cuts and Applications , 2009, NIPS.

[76]  Guillermo Sapiro,et al.  Online Learning for Matrix Factorization and Sparse Coding , 2009, J. Mach. Learn. Res..

[77]  M. Yuan,et al.  Model selection and estimation in regression with grouped variables , 2006 .

[78]  L. Rudin,et al.  Nonlinear total variation based noise removal algorithms , 1992 .