On Sparsity Inducing Regularization Methods for Machine Learning

During the past few years there has been an explosion of interest in learning methods based on sparsity regularization. In this chapter, we discuss a general class of such methods, in which the regularizer can be expressed as the composition of a convex function ω with a linear function. This setting includes several methods such as the Group Lasso, the Fused Lasso, multi-task learning and many more. We present a general approach for solving regularization problems of this kind, under the assumption that the proximity operator of the function ω is available. Furthermore, we comment on the application of this approach to support vector machines, a technique pioneered by the groundbreaking work of Vladimir Vapnik.

[1]  Yurii Nesterov,et al.  Introductory Lectures on Convex Optimization - A Basic Course , 2014, Applied Optimization.

[2]  Lorenzo Rosasco,et al.  Solving Structured Sparsity Regularization with Proximal Methods , 2010, ECML/PKDD.

[3]  Claudio Gentile,et al.  Linear Algorithms for Online Multitask Classification , 2010, COLT.

[4]  Luca Baldassarre,et al.  Accelerated and Inexact Forward-Backward Algorithms , 2013, SIAM J. Optim..

[5]  Tomaso A. Poggio,et al.  Regularization Networks and Support Vector Machines , 2000, Adv. Comput. Math..

[6]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[7]  J. Borwein,et al.  Convex Analysis And Nonlinear Optimization , 2000 .

[8]  Y. Nesterov A method for solving the convex programming problem with convergence rate O(1/k^2) , 1983 .

[9]  Patrick L. Combettes,et al.  Signal Recovery by Proximal Forward-Backward Splitting , 2005, Multiscale Model. Simul..

[10]  Massimiliano Pontil,et al.  Structured Sparsity and Generalization , 2011, J. Mach. Learn. Res..

[11]  C. Zălinescu Convex analysis in general vector spaces , 2002 .

[12]  P. Zhao,et al.  Grouped and Hierarchical Model Selection through Composite Absolute Penalties , 2007 .

[13]  Bernhard E. Boser,et al.  A training algorithm for optimal margin classifiers , 1992, COLT '92.

[14]  Charles A. Micchelli,et al.  Feature space perspectives for learning the kernel , 2006, Machine Learning.

[15]  Charles A. Micchelli,et al.  A Family of Penalty Functions for Structured Sparsity , 2010, NIPS.

[16]  Lixin Shen,et al.  Efficient First Order Methods for Linear Composite Regularizers , 2011, ArXiv.

[17]  M. Pontil,et al.  A Convex Optimization Approach to Modeling Consumer Heterogeneity in Conjoint Estimation , 2007 .

[18]  Patrick L. Combettes,et al.  Proximal Splitting Methods in Signal Processing , 2009, Fixed-Point Algorithms for Inverse Problems in Science and Engineering.

[19]  Gaston H. Gonnet,et al.  Advances in Computational Mathematics , 1996 .

[20]  C. Micchelli,et al.  Proximity algorithms for image models: denoising , 2011 .

[21]  M. Yuan,et al.  Model selection and estimation in regression with grouped variables , 2006 .

[22]  Massimiliano Pontil,et al.  From regression to classification in support vector machines , 1999, ESANN.

[23]  Y. Nesterov Gradient methods for minimizing composite objective function , 2007 .

[24]  Massimiliano Pontil,et al.  Properties of Support Vector Machines , 1998, Neural Computation.

[25]  Guy Lever,et al.  Predicting the Labelling of a Graph via Minimum $p$-Seminorm Interpolation , 2009, COLT.

[26]  J. Moreau Fonctions convexes duales et points proximaux dans un espace hilbertien , 1962 .

[27]  Massimiliano Pontil,et al.  Convex multi-task feature learning , 2008, Machine Learning.

[28]  Marc Teboulle,et al.  A Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse Problems , 2009, SIAM J. Imaging Sci..

[29]  R. Tibshirani,et al.  Sparsity and smoothness via the fused lasso , 2005 .

[30]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[31]  Taiji Suzuki,et al.  SpicyMKL: a fast algorithm for Multiple Kernel Learning with thousands of kernels , 2011, Machine Learning.

[32]  Mark Herbster,et al.  Prediction on a Graph with a Perceptron , 2006, NIPS.

[33]  Paul Tseng,et al.  Approximation accuracy, gradient methods, and error bound for structured convex optimization , 2010, Math. Program..

[34]  Francis R. Bach,et al.  Structured Variable Selection with Sparsity-Inducing Norms , 2009, J. Mach. Learn. Res..

[35]  Gunnar Rätsch,et al.  Large Scale Multiple Kernel Learning , 2006, J. Mach. Learn. Res..

[36]  Yurii Nesterov,et al.  Smooth minimization of non-smooth functions , 2005, Math. Program..

[37]  Stephen P. Boyd,et al.  Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers , 2011, Found. Trends Mach. Learn..