Grouped Variable Selection with Discrete Optimization: Computational and Statistical Perspectives

We present a new algorithmic framework for grouped variable selection that is based on discrete mathematical optimization. While there exist several appealing approaches based on convex relaxations and nonconvex heuristics, we focus on optimal solutions for the `0regularized formulation, a problem that is relatively unexplored due to computational challenges. Our methodology covers both high-dimensional linear regression and nonparametric sparse additive modeling with smooth components. Our algorithmic framework consists of approximate and exact algorithms. The approximate algorithms are based on coordinate descent and local search, with runtimes comparable to popular sparse learning algorithms. Our exact algorithm is based on a standalone branch-and-bound (BnB) framework, which can solve the associated mixed integer programming (MIP) problem to certified optimality. By exploiting the problem structure, our custom BnB algorithm can solve to optimality problem instances with 5 × 10 features in minutes to hours – over 1000 times larger than what is currently possible using state-of-the-art commercial MIP solvers. We also explore statistical properties of the `0-based estimators. We demonstrate, theoretically and empirically, that our proposed estimators have an edge over popular group-sparse estimators in terms of statistical performance in various regimes.

[1]  J. Tropp Algorithms for simultaneous sparse approximation. Part II: Convex relaxation , 2006, Signal Process..

[2]  Rahul Mazumder,et al.  Sparse Regression at Scale: Branch-and-Bound rooted in First-Order Optimization , 2020, ArXiv.

[3]  Trevor Hastie,et al.  Regularization Paths for Generalized Linear Models via Coordinate Descent. , 2010, Journal of statistical software.

[4]  Cun-Hui Zhang Nearly unbiased variable selection under minimax concave penalty , 2010, 1002.4734.

[5]  S. Geer,et al.  High-dimensional additive modeling , 2008, 0806.4115.

[6]  Julien Mairal,et al.  Structured sparsity through convex optimization , 2011, ArXiv.

[7]  P. Zhao,et al.  The composite absolute penalties family for grouped and hierarchical variable selection , 2009, 0909.0411.

[8]  Junzhou Huang,et al.  The Benefit of Group Sparsity , 2009 .

[9]  Yuan Liao,et al.  Sparse HP filter: Finding kinks in the COVID-19 contact rate , 2020, 2006.10555.

[10]  Rahul Mazumder,et al.  Learning Hierarchical Interactions at Scale: A Convex Optimization Approach , 2019, AISTATS.

[11]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[12]  M. Jünger,et al.  50 Years of Integer Programming 1958-2008 - From the Early Years to the State-of-the-Art , 2010 .

[13]  Martin J. Wainwright,et al.  Minimax-Optimal Rates For Sparse Additive Models Over Kernel Classes Via Convex Programming , 2010, J. Mach. Learn. Res..

[14]  R. Tibshirani,et al.  Generalized Additive Models , 1986 .

[15]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[16]  Claudio Gentile,et al.  Perspective cuts for a class of convex 0–1 mixed integer programs , 2006, Math. Program..

[17]  Hussein Hazimeh,et al.  Fast Best Subset Selection: Coordinate Descent and Local Combinatorial Optimization Algorithms , 2018, Oper. Res..

[18]  A. Tsybakov,et al.  Slope meets Lasso: Improved oracle bounds and optimality , 2016, The Annals of Statistics.

[19]  Oktay Günlük,et al.  Perspective reformulations of mixed integer nonlinear programs with indicator variables , 2010, Math. Program..

[20]  S. Agmon Lectures on Elliptic Boundary Value Problems , 1965 .

[21]  Balas K. Natarajan,et al.  Sparse Approximate Solutions to Linear Systems , 1995, SIAM J. Comput..

[22]  Alper Atamtürk,et al.  Sparse and Smooth Signal Estimation: Convexification of L0 Formulations , 2018, J. Mach. Learn. Res..

[23]  Runze Li,et al.  CALIBRATING NON-CONVEX PENALIZED REGRESSION IN ULTRA-HIGH DIMENSION. , 2013, Annals of statistics.

[24]  M. Birman,et al.  PIECEWISE-POLYNOMIAL APPROXIMATIONS OF FUNCTIONS OF THE CLASSES $ W_{p}^{\alpha}$ , 1967 .

[25]  J. Lafferty,et al.  Sparse additive models , 2007, 0711.4555.

[26]  Rahul Mazumder,et al.  The Discrete Dantzig Selector: Estimating Sparse Linear Models via Mixed Integer Linear Optimization , 2015, IEEE Transactions on Information Theory.

[27]  A. Owen A robust hybrid of lasso and ridge regression , 2006 .

[28]  A. Rinaldo,et al.  On the asymptotic properties of the group lasso estimator for linear models , 2008 .

[29]  Ali Shojaie,et al.  Integer Programming for Learning Directed Acyclic Graphs from Continuous Data , 2019, INFORMS J. Optim..

[30]  G. Wahba Spline models for observational data , 1990 .

[31]  Santanu S. Dey,et al.  Using ℓ1-Relaxation and Integer Programming to Obtain Dual Bounds for Sparse PCA , 2018, Oper. Res..

[32]  Bart P. G. Van Parys,et al.  Sparse high-dimensional regression: Exact scalable algorithms and phase transitions , 2017, The Annals of Statistics.

[33]  Masashi Sugiyama,et al.  Fast Learning Rate of Multiple Kernel Learning: Trade-Off between Sparsity and Smoothness , 2012, AISTATS.

[34]  Jian Huang,et al.  Group descent algorithms for nonconvex penalized linear and logistic regression models with grouped predictors , 2012, Statistics and Computing.

[35]  Ming Yuan,et al.  Minimax Optimal Rates of Estimation in High Dimensional Additive Models: Universal Phase Transition , 2015, ArXiv.

[36]  Weijun Xie,et al.  Scalable Algorithms for the Sparse Ridge Regression , 2018, SIAM J. Optim..

[37]  Roland Glowinski,et al.  An introduction to the mathematical theory of finite elements , 1976 .

[38]  Myung Hwan Seo,et al.  Factor-driven two-regime regression , 2018, The Annals of Statistics.

[39]  P. Radchenko,et al.  Subset Selection with Shrinkage: Sparse Linear Modeling When the SNR Is Low , 2017, Oper. Res..

[40]  Hosik Choi,et al.  Consistent Model Selection Criteria on High Dimensions , 2012, J. Mach. Learn. Res..

[41]  S. Geer,et al.  Oracle Inequalities and Optimal Inference under Group Sparsity , 2010, 1007.1771.

[42]  Babak Hassibi,et al.  On the Reconstruction of Block-Sparse Signals With an Optimal Number of Measurements , 2008, IEEE Transactions on Signal Processing.

[43]  Gareth M. James,et al.  Variable Selection Using Adaptive Nonlinear Interaction Structures in High Dimensions , 2010 .

[44]  Jie Chen,et al.  Theoretical Results on Sparse Representations of Multiple-Measurement Vectors , 2006, IEEE Transactions on Signal Processing.

[45]  Michael I. Jordan,et al.  Optimal prediction for sparse linear models? Lower bounds for coordinate-separable M-estimators , 2015, 1503.03188.

[46]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[47]  Dmitry M. Malioutov,et al.  A sparse signal reconstruction perspective for source localization with sensor arrays , 2005, IEEE Transactions on Signal Processing.

[48]  Amir Beck,et al.  On the Convergence of Block Coordinate Descent Type Methods , 2013, SIAM J. Optim..

[49]  M. Yuan,et al.  Model selection and estimation in regression with grouped variables , 2006 .

[50]  Jeff T. Linderoth,et al.  Regularization vs. Relaxation: A conic optimization perspective of statistical variable selection , 2015, ArXiv.

[51]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[52]  Cun-Hui Zhang,et al.  The sparsity and bias of the Lasso selection in high-dimensional linear regression , 2008, 0808.0967.

[53]  R. Mazumder,et al.  Learning Sparse Classifiers: Continuous and Mixed Integer Optimization Perspectives , 2020, J. Mach. Learn. Res..

[54]  Sara van de Geer,et al.  Statistics for High-Dimensional Data: Methods, Theory and Applications , 2011 .

[55]  Christian Kirches,et al.  Mixed-integer nonlinear optimization*† , 2013, Acta Numerica.

[56]  A. Atkinson Subset Selection in Regression , 1992 .

[57]  Jens Clausen,et al.  On the best search strategy in parallel branch‐and‐bound:Best‐First Search versus Lazy Depth‐First Search , 1999, Ann. Oper. Res..

[58]  Hao Helen Zhang,et al.  Component selection and smoothing in multivariate nonparametric regression , 2006, math/0702659.

[59]  Jian Huang,et al.  A Selective Review of Group Selection in High-Dimensional Models. , 2012, Statistical science : a review journal of the Institute of Mathematical Statistics.

[60]  Sheldon H. Jacobson,et al.  Branch-and-bound algorithms: A survey of recent advances in searching, branching, and pruning , 2016, Discret. Optim..

[61]  D. Bertsimas,et al.  Best Subset Selection via a Modern Optimization Lens , 2015, 1507.03133.

[62]  Michael P. Friedlander,et al.  Theoretical and Empirical Results for Recovery From Multiple Measurements , 2009, IEEE Transactions on Information Theory.

[63]  S. Geer,et al.  Correlated variables in regression: Clustering and sparse estimation , 2012, 1209.5908.

[64]  Andrea Lodi,et al.  Cardinality Minimization, Constraints, and Regularization: A Survey , 2021, ArXiv.

[65]  Trevor Hastie,et al.  Statistical Learning with Sparsity: The Lasso and Generalizations , 2015 .

[66]  T. Blumensath,et al.  Iterative Thresholding for Sparse Approximations , 2008 .

[67]  Robert E. Bixby,et al.  A Brief History of Linear and Mixed-Integer Programming Computation , 2012 .

[68]  Zhaosong Lu,et al.  Iterative hard thresholding methods for l0\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$l_0$$\end{document} regulari , 2012, Mathematical Programming.

[69]  S. R. Jammalamadaka,et al.  Empirical Processes in M-Estimation , 2001 .

[70]  C. Chesneau,et al.  Some theoretical results on the Grouped Variables Lasso , 2008 .

[71]  Dimitris Bertsimas,et al.  Multivariate Statistics and Machine Learning Under a Modern Optimization Lens , 2015 .

[72]  Cun-Hui Zhang,et al.  Doubly penalized estimation in additive regression with high-dimensional data , 2019, The Annals of Statistics.

[73]  Jian Huang,et al.  Consistent group selection in high-dimensional linear regression. , 2010, Bernoulli : official journal of the Bernoulli Society for Mathematical Statistics and Probability.

[74]  Julian J. McAuley,et al.  Ups and Downs: Modeling the Visual Evolution of Fashion Trends with One-Class Collaborative Filtering , 2016, WWW.

[75]  J. Horowitz,et al.  VARIABLE SELECTION IN NONPARAMETRIC ADDITIVE MODELS. , 2010, Annals of statistics.

[76]  V. Koltchinskii,et al.  SPARSITY IN MULTIPLE KERNEL LEARNING , 2010, 1211.2998.

[77]  Yonina C. Eldar,et al.  Sparsity Constrained Nonlinear Optimization: Optimality Conditions and Algorithms , 2012, SIAM J. Optim..

[78]  Bhaskar D. Rao,et al.  Sparse solutions to linear inverse problems with multiple measurement vectors , 2005, IEEE Transactions on Signal Processing.

[79]  T. Hastie,et al.  SparseNet: Coordinate Descent With Nonconvex Penalties , 2011, Journal of the American Statistical Association.