Learning Sparse Classifiers: Continuous and Mixed Integer Optimization Perspectives

We consider a discrete optimization based approach for learning sparse classifiers, where the outcome depends upon a linear combination of a small subset of features. Recent work has shown that mixed integer programming (MIP) can be used to solve (to optimality) $\ell_0$-regularized problems at scales much larger than what was conventionally considered possible in the statistics and machine learning communities. Despite their usefulness, MIP-based approaches are significantly slower compared to relatively mature algorithms based on $\ell_1$-regularization and relatives. We aim to bridge this computational gap by developing new MIP-based algorithms for $\ell_0$-regularized classification. We propose two classes of scalable algorithms: an exact algorithm that can handle $p\approx 50,000$ features in a few minutes, and approximate algorithms that can address instances with $p\approx 10^6$ in times comparable to fast $\ell_1$-based algorithms. Our exact algorithm is based on the novel idea of \textsl{integrality generation}, which solves the original problem (with $p$ binary variables) via a sequence of mixed integer programs that involve a small number of binary variables. Our approximate algorithms are based on coordinate descent and local combinatorial search. In addition, we present new estimation error bounds for a class of $\ell_0$-regularized estimators. Experiments on real and synthetic data demonstrate that our approach leads to models with considerably improved statistical performance (especially, variable selection) when compared to competing toolkits.

[1]  Laurence A. Wolsey,et al.  Integer and Combinatorial Optimization , 1988 .

[2]  A. Atkinson Subset Selection in Regression , 1992 .

[3]  Dimitri P. Bertsekas,et al.  Nonlinear Programming , 1997 .

[4]  Balas K. Natarajan,et al.  Sparse Approximate Solutions to Linear Systems , 1995, SIAM J. Comput..

[5]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[6]  P. Tseng Convergence of a Block Coordinate Descent Method for Nondifferentiable Minimization , 2001 .

[7]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[8]  Eric R. Ziegel,et al.  The Elements of Statistical Learning , 2003, Technometrics.

[9]  Steve R. Gunn,et al.  Result Analysis of the NIPS 2003 Feature Selection Challenge , 2004, NIPS.

[10]  S. Geer,et al.  Classifiers of support vector machine type with \ell1 complexity regularization , 2006 .

[11]  E. Greenshtein Best subset selection, persistence in high-dimensional statistical learning and optimization under l1 constraint , 2006, math/0702684.

[12]  A. Tsybakov,et al.  Aggregation for Gaussian regression , 2007, 0710.3654.

[13]  Terence Tao,et al.  The Dantzig selector: Statistical estimation when P is much larger than n , 2005, math/0506081.

[14]  Stephen P. Boyd,et al.  Enhancing Sparsity by Reweighted ℓ1 Minimization , 2007, 0711.1612.

[15]  S. Geer HIGH-DIMENSIONAL GENERALIZED LINEAR MODELS AND THE LASSO , 2008, 0804.0703.

[16]  Changyi Park,et al.  A Bahadur Representation of the Linear Support Vector Machine , 2008, J. Mach. Learn. Res..

[17]  Mike E. Davies,et al.  Iterative Hard Thresholding for Compressed Sensing , 2008, ArXiv.

[18]  Martin J. Wainwright,et al.  A unified framework for high-dimensional analysis of $M$-estimators with decomposable regularizers , 2009, NIPS.

[19]  Christodoulos A. Floudas,et al.  Mixed Integer Nonlinear Programming , 2009, Encyclopedia of Optimization.

[20]  A. Belloni,et al.  L1-Penalized Quantile Regression in High Dimensional Sparse Models , 2009, 0904.2931.

[21]  P. Bickel,et al.  SIMULTANEOUS ANALYSIS OF LASSO AND DANTZIG SELECTOR , 2008, 0801.1095.

[22]  Trevor Hastie,et al.  Regularization Paths for Generalized Linear Models via Coordinate Descent. , 2010, Journal of statistical software.

[23]  Cun-Hui Zhang Nearly unbiased variable selection under minimax concave penalty , 2010, 1002.4734.

[24]  J. Lafferty,et al.  High-dimensional Ising model selection using ℓ1-regularized logistic regression , 2010, 1010.0311.

[25]  Martin J. Wainwright,et al.  Minimax Rates of Estimation for High-Dimensional Linear Regression Over $\ell_q$ -Balls , 2009, IEEE Transactions on Information Theory.

[26]  Bhiksha Raj,et al.  Greedy sparsity-constrained optimization , 2011, 2011 Conference Record of the Forty Fifth Asilomar Conference on Signals, Systems and Computers (ASILOMAR).

[27]  Emmanuel J. Candès,et al.  How well can we estimate a sparse vector? , 2011, ArXiv.

[28]  T. Hastie,et al.  SparseNet: Coordinate Descent With Nonconvex Penalties , 2011, Journal of the American Statistical Association.

[29]  Sara van de Geer,et al.  Statistics for High-Dimensional Data: Methods, Theory and Applications , 2011 .

[30]  Jian Huang,et al.  COORDINATE DESCENT ALGORITHMS FOR NONCONVEX PENALIZED REGRESSION, WITH APPLICATIONS TO BIOLOGICAL FEATURE SELECTION. , 2011, The annals of applied statistics.

[31]  Yurii Nesterov,et al.  Efficiency of Coordinate Descent Methods on Huge-Scale Optimization Problems , 2012, SIAM J. Optim..

[32]  Fengrong Wei,et al.  Group coordinate descent algorithms for nonconvex penalized regression , 2012, Comput. Stat. Data Anal..

[33]  Amir Beck,et al.  On the Convergence of Block Coordinate Descent Type Methods , 2013, SIAM J. Optim..

[34]  Yaniv Plan,et al.  Robust 1-bit Compressed Sensing and Sparse Logistic Regression: A Convex Programming Approach , 2012, IEEE Transactions on Information Theory.

[35]  Yonina C. Eldar,et al.  Sparsity Constrained Nonlinear Optimization: Optimality Conditions and Algorithms , 2012, SIAM J. Optim..

[36]  Christian Kirches,et al.  Mixed-integer nonlinear optimization*† , 2013, Acta Numerica.

[37]  Jieping Ye,et al.  A General Iterative Shrinkage and Thresholding Algorithm for Non-convex Regularized Optimization Problems , 2013, ICML.

[38]  Ted K. Ralphs,et al.  Integer and Combinatorial Optimization , 2013 .

[39]  Yurii Nesterov,et al.  Gradient methods for minimizing composite functions , 2012, Mathematical Programming.

[40]  Zhaosong Lu,et al.  Iterative hard thresholding methods for l0\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$l_0$$\end{document} regulari , 2012, Mathematical Programming.

[41]  Stephen P. Boyd,et al.  Proximal Algorithms , 2013, Found. Trends Optim..

[42]  Trevor Hastie,et al.  Statistical Learning with Sparsity: The Lasso and Generalizations , 2015 .

[43]  Huan Li,et al.  Accelerated Proximal Gradient Methods for Nonconvex Programming , 2015, NIPS.

[44]  Cynthia Rudin,et al.  Supersparse linear integer models for optimized medical scoring systems , 2015, Machine Learning.

[45]  Ion Necoara,et al.  Random Coordinate Descent Methods for $\ell_{0}$ Regularized Convex Optimization , 2014, IEEE Transactions on Automatic Control.

[46]  Stephen J. Wright Coordinate descent algorithms , 2015, Mathematical Programming.

[47]  D. Bertsimas,et al.  Best Subset Selection via a Modern Optimization Lens , 2015, 1507.03133.

[48]  P. Rigollet 18.S997: High Dimensional Statistics , 2015 .

[49]  Toshiki Sato,et al.  Feature subset selection for logistic regression via mixed integer optimization , 2016, Computational Optimization and Applications.

[50]  Bo Peng,et al.  An Error Bound for L1-norm Support Vector Machine Coefficients in Ultra-high Dimension , 2016, J. Mach. Learn. Res..

[51]  Runze Li,et al.  Variable selection for support vector machines in moderately high dimensions , 2016, Journal of the Royal Statistical Society. Series B, Statistical methodology.

[52]  Asuman E. Ozdaglar,et al.  When Cyclic Coordinate Descent Outperforms Randomized Coordinate Descent , 2017, NIPS.

[53]  Dimitris Bertsimas,et al.  Logistic Regression: From Art to Science , 2017 .

[54]  R. Tibshirani,et al.  Extended Comparisons of Best Subset Selection, Forward Stepwise Selection, and the Lasso , 2017, 1707.08692.

[55]  Bart P. G. Van Parys,et al.  Sparse high-dimensional regression: Exact scalable algorithms and phase transitions , 2017, The Annals of Statistics.

[56]  Qingshan Liu,et al.  Newton-Type Greedy Selection Methods for $\ell _0$ -Constrained Minimization , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[57]  P. Radchenko,et al.  Subset Selection with Shrinkage: Sparse Linear Modeling When the SNR Is Low , 2017, Oper. Res..

[58]  Bart P. G. Van Parys,et al.  Sparse Classification and Phase Transitions: A Discrete Optimization Perspective , 2017 .

[59]  Shenglong Zhou,et al.  Fast Newton Method for Sparse Logistic Regression , 2019, 1901.02768.

[60]  Hussein Hazimeh,et al.  Fast Best Subset Selection: Coordinate Descent and Local Combinatorial Optimization Algorithms , 2018, Oper. Res..

[61]  Trevor Hastie,et al.  Best Subset, Forward Stepwise or Lasso? Analysis and Recommendations Based on Extensive Comparisons , 2020 .

[62]  R. Mazumder,et al.  Sparse regression at scale: branch-and-bound rooted in first-order optimization , 2020, Mathematical Programming.

[63]  Rahul Mazumder,et al.  Grouped Variable Selection with Discrete Optimization: Computational and Statistical Perspectives , 2021, The Annals of Statistics.

[64]  Naihua Xiu,et al.  Global and Quadratic Convergence of Newton Hard-Thresholding Pursuit , 2019, J. Mach. Learn. Res..