Sparse learning via Boolean relaxations

We introduce novel relaxations for cardinality-constrained learning problems, including least-squares regression as a special but important case. Our approach is based on reformulating a cardinality-constrained problem exactly as a Boolean program, to which standard convex relaxations such as the Lasserre and Sherali-Adams hierarchies can be applied. We analyze the first-order relaxation in detail, deriving necessary and sufficient conditions for exactness in a unified manner. In the special case of least-squares regression, we show that these conditions are satisfied with high probability for random ensembles satisfying suitable incoherence conditions, similar to results on $$\ell _1$$ℓ1-relaxations. In contrast to known methods, our relaxations yield lower bounds on the objective, and it can be verified whether or not the relaxation is exact. If it is not, we show that randomization based on the relaxed solution offers a principled way to generate provably good feasible solutions. This property enables us to obtain high quality estimates even if incoherence conditions are not met, as might be expected in real datasets. We numerically illustrate the performance of the relaxation-randomization strategy in both synthetic and real high-dimensional datasets, revealing substantial improvements relative to $$\ell _1$$ℓ1-based methods and greedy selection heuristics.

[1]  Venkat Chandrasekaran,et al.  Recovery of Sparse Probability Measures via Convex Programming , 2012, NIPS.

[2]  Rudolf Ahlswede,et al.  Strong converse for identification via quantum channels , 2000, IEEE Trans. Inf. Theory.

[3]  Nello Cristianini,et al.  An Introduction to Support Vector Machines and Other Kernel-based Learning Methods , 2000 .

[4]  S. Szarek,et al.  Chapter 8 - Local Operator Theory, Random Matrices and Banach Spaces , 2001 .

[5]  Warren P. Adams,et al.  A hierarchy of relaxation between the continuous and convex hull representations , 1990 .

[6]  Michael Elad,et al.  Stable recovery of sparse overcomplete representations in the presence of noise , 2006, IEEE Transactions on Information Theory.

[7]  J. Tropp JUST RELAX: CONVEX PROGRAMMING METHODS FOR SUBSET SELECTION AND SPARSE APPROXIMATION , 2004 .

[8]  Martin J. Wainwright,et al.  Information-Theoretic Limits on Sparsity Recovery in the High-Dimensional and Noisy Setting , 2007, IEEE Transactions on Information Theory.

[9]  Faperj Sums of random Hermitian matrices and an inequality by Rudelson , 2010 .

[10]  Susan A. Murphy,et al.  Monographs on statistics and applied probability , 1990 .

[11]  Martin J. Wainwright,et al.  Lower bounds on the performance of polynomial-time algorithms for sparse linear regression , 2014, COLT.

[12]  Sara van de Geer,et al.  Statistics for High-Dimensional Data , 2011 .

[13]  Hanif D. Sherali,et al.  A Hierarchy of Relaxations Between the Continuous and Convex Hull Representations for Zero-One Programming Problems , 1990, SIAM J. Discret. Math..

[14]  P. McCullagh,et al.  Generalized Linear Models , 1992 .

[15]  Mark W. Schmidt,et al.  Optimizing Costly Functions with Simple Constraints: A Limited-Memory Projected Quasi-Newton Algorithm , 2009, AISTATS.

[16]  Yurii Nesterov,et al.  Primal-dual subgradient methods for convex problems , 2005, Math. Program..

[17]  Rajeev Motwani,et al.  Randomized Algorithms , 1995, SIGA.

[18]  N. Meinshausen,et al.  High-dimensional graphs and variable selection with the Lasso , 2006, math/0608017.

[19]  Jean B. Lasserre,et al.  An Explicit Exact SDP Relaxation for Nonlinear 0-1 Programs , 2001, IPCO.

[20]  Yoram Singer,et al.  Support Vector Machines on a Budget , 2006, NIPS.

[21]  Frigyes Csáki,et al.  2nd International Symposium on Information Theory, Tsahkadsor, Armenia, USSR, September 2-8, 1971 : [papers] , 1973 .

[22]  Monique Laurent,et al.  A Comparison of the Sherali-Adams, Lovász-Schrijver, and Lasserre Relaxations for 0-1 Programming , 2003, Math. Oper. Res..

[23]  Michael I. Jordan,et al.  Treewidth-based conditions for exactness of the Sherali-Adams and Lasserre relaxations , 2004 .

[24]  Martin J. Wainwright,et al.  Information-theoretic limits on sparsity recovery in the high-dimensional and noisy setting , 2009, IEEE Trans. Inf. Theory.

[25]  Alexander Schrijver,et al.  Cones of Matrices and Set-Functions and 0-1 Optimization , 1991, SIAM J. Optim..

[26]  Alexandre d'Aspremont,et al.  Testing the nullspace property using semidefinite programming , 2008, Math. Program..

[27]  P. McCullagh,et al.  Generalized Linear Models , 1984 .

[28]  Wasserman,et al.  Bayesian Model Selection and Model Averaging. , 2000, Journal of mathematical psychology.

[29]  Hanif D. Sherali,et al.  A Hierarchy of Relaxations and Convex Hull Characterizations for Mixed-integer Zero-one Programming Problems , 1994, Discret. Appl. Math..

[30]  Emmanuel J. Candès,et al.  Decoding by linear programming , 2005, IEEE Transactions on Information Theory.

[31]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[32]  D. L. Donoho,et al.  Compressed sensing , 2006, IEEE Trans. Inf. Theory.

[33]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[34]  Martin J. Wainwright,et al.  Sharp Thresholds for High-Dimensional and Noisy Sparsity Recovery Using $\ell _{1}$ -Constrained Quadratic Programming (Lasso) , 2009, IEEE Transactions on Information Theory.

[35]  Peng Zhao,et al.  On Model Selection Consistency of Lasso , 2006, J. Mach. Learn. Res..

[36]  M. Wainwright Structured Regularizers for High-Dimensional Problems: Statistical and Computational Issues , 2014 .

[37]  H. Akaike,et al.  Information Theory and an Extension of the Maximum Likelihood Principle , 1973 .

[38]  P. Bickel,et al.  SIMULTANEOUS ANALYSIS OF LASSO AND DANTZIG SELECTOR , 2008, 0801.1095.

[39]  Stephen P. Boyd,et al.  Graph Implementations for Nonsmooth Convex Programs , 2008, Recent Advances in Learning and Control.

[40]  Jean-Jacques Fuchs,et al.  Recovery of exact sparse representations in the presence of noise , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[41]  M. Ledoux The concentration of measure phenomenon , 2001 .

[42]  Sara van de Geer,et al.  Statistics for High-Dimensional Data: Methods, Theory and Applications , 2011 .

[43]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .