An Efficient Projection for l 1 , ∞ Regularization

In recent years the l1,∞ norm has been proposed for joint regularization. In essence, this type of regularization aims at extending the l1 framework for learning sparse models to a setting where the goal is to learn a set of jointly sparse models. In this paper we derive a simple and effective projected gradient method for optimization of l1,∞ regularized problems. The main challenge in developing such a method resides on being able to compute efficient projections to the l1,∞ ball. We present an algorithm that works in O(n log n) time and O(n) memory where n is the number of parameters. We test our algorithm in a multi-task image annotation problem. Our results show that l1,∞ leads to better performance than both l2 and l1 regularization and that it is is effective in discovering jointly sparse solutions.

[1]  B. Belkhouche,et al.  Acknowledgements We Would like to Thank , 1993 .

[2]  Dimitri P. Bertsekas,et al.  Nonlinear Programming , 1997 .

[3]  Martin Zinkevich,et al.  Online Convex Programming and Generalized Infinitesimal Gradient Ascent , 2003, ICML.

[4]  A. Ng Feature selection, L1 vs. L2 regularization, and rotational invariance , 2004, Twenty-first international conference on Machine learning - ICML '04.

[5]  Dan Roth,et al.  Generalization Bounds for the Area Under the ROC Curve , 2005, J. Mach. Learn. Res..

[6]  Stephen J. Wright,et al.  Simultaneous Variable Selection , 2005, Technometrics.

[7]  D. Donoho For most large underdetermined systems of linear equations the minimal 𝓁1‐norm solution is also the sparsest solution , 2006 .

[8]  J. Tropp Algorithms for simultaneous sparse approximation. Part II: Convex relaxation , 2006, Signal Process..

[9]  Michael I. Jordan,et al.  Multi-task feature selection , 2006 .

[10]  Mee Young Park,et al.  Regularization Path Algorithms for Detecting Gene Interactions , 2006 .

[11]  David Nistér,et al.  Scalable Recognition with a Vocabulary Tree , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[12]  M. Yuan,et al.  Model selection and estimation in regression with grouped variables , 2006 .

[13]  Massimiliano Pontil,et al.  Multi-Task Feature Learning , 2006, NIPS.

[14]  Daphne Koller,et al.  Efficient Structure Learning of Markov Networks using L1-Regularization , 2006, NIPS.

[15]  Timo Similä,et al.  Input selection and shrinkage in multiresponse linear regression , 2007, Comput. Stat. Data Anal..

[16]  Trevor Darrell,et al.  The Pyramid Match Kernel: Efficient Learning with Sets of Features , 2007, J. Mach. Learn. Res..

[17]  Mark W. Schmidt,et al.  Structure learning in random fields for heart motion abnormality detection , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[18]  Trevor Darrell,et al.  Transfer learning for image classification with sparse prototype representations , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[19]  Yoram Singer,et al.  Efficient projections onto the l1-ball for learning in high dimensions , 2008, ICML '08.

[20]  P. Bühlmann,et al.  The group lasso for logistic regression , 2008 .

[21]  Mark W. Schmidt,et al.  Optimizing Costly Functions with Simple Constraints: A Limited-Memory Projected Quasi-Newton Algorithm , 2009, AISTATS.

[22]  Yoram Singer,et al.  Pegasos: primal estimated sub-gradient solver for SVM , 2011, Math. Program..