论文信息 - Learning Sparse Classifiers: Continuous and Mixed Integer Optimization Perspectives

Learning Sparse Classifiers: Continuous and Mixed Integer Optimization Perspectives

We consider a discrete optimization based approach for learning sparse classifiers, where the outcome depends upon a linear combination of a small subset of features. Recent work has shown that mixed integer programming (MIP) can be used to solve (to optimality) $\ell_0$-regularized problems at scales much larger than what was conventionally considered possible in the statistics and machine learning communities. Despite their usefulness, MIP-based approaches are significantly slower compared to relatively mature algorithms based on $\ell_1$-regularization and relatives. We aim to bridge this computational gap by developing new MIP-based algorithms for $\ell_0$-regularized classification. We propose two classes of scalable algorithms: an exact algorithm that can handle $p\approx 50,000$ features in a few minutes, and approximate algorithms that can address instances with $p\approx 10^6$ in times comparable to fast $\ell_1$-based algorithms. Our exact algorithm is based on the novel idea of \textsl{integrality generation}, which solves the original problem (with $p$ binary variables) via a sequence of mixed integer programs that involve a small number of binary variables. Our approximate algorithms are based on coordinate descent and local combinatorial search. In addition, we present new estimation error bounds for a class of $\ell_0$-regularized estimators. Experiments on real and synthetic data demonstrate that our approach leads to models with considerably improved statistical performance (especially, variable selection) when compared to competing toolkits.

[1] Laurence A. Wolsey,et al. Integer and Combinatorial Optimization , 1988 .

[2] A. Atkinson. Subset Selection in Regression , 1992 .

[3] Dimitri P. Bertsekas,et al. Nonlinear Programming , 1997 .

[4] Balas K. Natarajan,et al. Sparse Approximate Solutions to Linear Systems , 1995, SIAM J. Comput..

[5] R. Tibshirani. Regression Shrinkage and Selection via the Lasso , 1996 .

[6] P. Tseng. Convergence of a Block Coordinate Descent Method for Nondifferentiable Minimization , 2001 .

[7] Trevor Hastie,et al. The Elements of Statistical Learning , 2001 .

[8] Eric R. Ziegel,et al. The Elements of Statistical Learning , 2003, Technometrics.

[9] Steve R. Gunn,et al. Result Analysis of the NIPS 2003 Feature Selection Challenge , 2004, NIPS.

[10] S. Geer,et al. Classifiers of support vector machine type with \ell1 complexity regularization , 2006 .

[11] E. Greenshtein. Best subset selection, persistence in high-dimensional statistical learning and optimization under l1 constraint , 2006, math/0702684.

[12] A. Tsybakov,et al. Aggregation for Gaussian regression , 2007, 0710.3654.

[13] Terence Tao,et al. The Dantzig selector: Statistical estimation when P is much larger than n , 2005, math/0506081.

[14] Stephen P. Boyd,et al. Enhancing Sparsity by Reweighted ℓ1 Minimization , 2007, 0711.1612.

[15] S. Geer. HIGH-DIMENSIONAL GENERALIZED LINEAR MODELS AND THE LASSO , 2008, 0804.0703.

[16] Changyi Park,et al. A Bahadur Representation of the Linear Support Vector Machine , 2008, J. Mach. Learn. Res..

[17] Mike E. Davies,et al. Iterative Hard Thresholding for Compressed Sensing , 2008, ArXiv.

[18] Martin J. Wainwright,et al. A unified framework for high-dimensional analysis of $M$-estimators with decomposable regularizers , 2009, NIPS.

[19] Christodoulos A. Floudas,et al. Mixed Integer Nonlinear Programming , 2009, Encyclopedia of Optimization.

[20] A. Belloni,et al. L1-Penalized Quantile Regression in High Dimensional Sparse Models , 2009, 0904.2931.

[21] P. Bickel,et al. SIMULTANEOUS ANALYSIS OF LASSO AND DANTZIG SELECTOR , 2008, 0801.1095.

[22] Trevor Hastie,et al. Regularization Paths for Generalized Linear Models via Coordinate Descent. , 2010, Journal of statistical software.

[23] Cun-Hui Zhang. Nearly unbiased variable selection under minimax concave penalty , 2010, 1002.4734.

[24] J. Lafferty,et al. High-dimensional Ising model selection using ℓ1-regularized logistic regression , 2010, 1010.0311.

[25] Martin J. Wainwright,et al. Minimax Rates of Estimation for High-Dimensional Linear Regression Over $\ell_q$ -Balls , 2009, IEEE Transactions on Information Theory.

[26] Bhiksha Raj,et al. Greedy sparsity-constrained optimization , 2011, 2011 Conference Record of the Forty Fifth Asilomar Conference on Signals, Systems and Computers (ASILOMAR).

[27] Emmanuel J. Candès,et al. How well can we estimate a sparse vector? , 2011, ArXiv.

[28] T. Hastie,et al. SparseNet: Coordinate Descent With Nonconvex Penalties , 2011, Journal of the American Statistical Association.

[29] Sara van de Geer,et al. Statistics for High-Dimensional Data: Methods, Theory and Applications , 2011 .

[30] Jian Huang,et al. COORDINATE DESCENT ALGORITHMS FOR NONCONVEX PENALIZED REGRESSION, WITH APPLICATIONS TO BIOLOGICAL FEATURE SELECTION. , 2011, The annals of applied statistics.

[31] Yurii Nesterov,et al. Efficiency of Coordinate Descent Methods on Huge-Scale Optimization Problems , 2012, SIAM J. Optim..

[32] Fengrong Wei,et al. Group coordinate descent algorithms for nonconvex penalized regression , 2012, Comput. Stat. Data Anal..

[33] Amir Beck,et al. On the Convergence of Block Coordinate Descent Type Methods , 2013, SIAM J. Optim..

[34] Yaniv Plan,et al. Robust 1-bit Compressed Sensing and Sparse Logistic Regression: A Convex Programming Approach , 2012, IEEE Transactions on Information Theory.

[35] Yonina C. Eldar,et al. Sparsity Constrained Nonlinear Optimization: Optimality Conditions and Algorithms , 2012, SIAM J. Optim..

[36] Christian Kirches,et al. Mixed-integer nonlinear optimization*† , 2013, Acta Numerica.

[37] Jieping Ye,et al. A General Iterative Shrinkage and Thresholding Algorithm for Non-convex Regularized Optimization Problems , 2013, ICML.

[38] Ted K. Ralphs,et al. Integer and Combinatorial Optimization , 2013 .

[39] Yurii Nesterov,et al. Gradient methods for minimizing composite functions , 2012, Mathematical Programming.

[40] Zhaosong Lu,et al. Iterative hard thresholding methods for l0\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$l_0$$\end{document} regulari , 2012, Mathematical Programming.

[41] Stephen P. Boyd,et al. Proximal Algorithms , 2013, Found. Trends Optim..

[42] Trevor Hastie,et al. Statistical Learning with Sparsity: The Lasso and Generalizations , 2015 .

[43] Huan Li,et al. Accelerated Proximal Gradient Methods for Nonconvex Programming , 2015, NIPS.

[44] Cynthia Rudin,et al. Supersparse linear integer models for optimized medical scoring systems , 2015, Machine Learning.

[45] Ion Necoara,et al. Random Coordinate Descent Methods for $\ell_{0}$ Regularized Convex Optimization , 2014, IEEE Transactions on Automatic Control.

[46] Stephen J. Wright. Coordinate descent algorithms , 2015, Mathematical Programming.

[47] D. Bertsimas,et al. Best Subset Selection via a Modern Optimization Lens , 2015, 1507.03133.

[48] P. Rigollet. 18.S997: High Dimensional Statistics , 2015 .

[49] Toshiki Sato,et al. Feature subset selection for logistic regression via mixed integer optimization , 2016, Computational Optimization and Applications.

[50] Bo Peng,et al. An Error Bound for L1-norm Support Vector Machine Coefficients in Ultra-high Dimension , 2016, J. Mach. Learn. Res..

[51] Runze Li,et al. Variable selection for support vector machines in moderately high dimensions , 2016, Journal of the Royal Statistical Society. Series B, Statistical methodology.

[52] Asuman E. Ozdaglar,et al. When Cyclic Coordinate Descent Outperforms Randomized Coordinate Descent , 2017, NIPS.

[53] Dimitris Bertsimas,et al. Logistic Regression: From Art to Science , 2017 .

[54] R. Tibshirani,et al. Extended Comparisons of Best Subset Selection, Forward Stepwise Selection, and the Lasso , 2017, 1707.08692.

[55] Bart P. G. Van Parys,et al. Sparse high-dimensional regression: Exact scalable algorithms and phase transitions , 2017, The Annals of Statistics.

[56] Qingshan Liu,et al. Newton-Type Greedy Selection Methods for $\ell _0$ -Constrained Minimization , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[57] P. Radchenko,et al. Subset Selection with Shrinkage: Sparse Linear Modeling When the SNR Is Low , 2017, Oper. Res..

[58] Bart P. G. Van Parys,et al. Sparse Classification and Phase Transitions: A Discrete Optimization Perspective , 2017 .

[59] Shenglong Zhou,et al. Fast Newton Method for Sparse Logistic Regression , 2019, 1901.02768.

[60] Hussein Hazimeh,et al. Fast Best Subset Selection: Coordinate Descent and Local Combinatorial Optimization Algorithms , 2018, Oper. Res..

[61] Trevor Hastie,et al. Best Subset, Forward Stepwise or Lasso? Analysis and Recommendations Based on Extensive Comparisons , 2020 .

[62] R. Mazumder,et al. Sparse regression at scale: branch-and-bound rooted in first-order optimization , 2020, Mathematical Programming.

[63] Rahul Mazumder,et al. Grouped Variable Selection with Discrete Optimization: Computational and Statistical Perspectives , 2021, The Annals of Statistics.

[64] Naihua Xiu,et al. Global and Quadratic Convergence of Newton Hard-Thresholding Pursuit , 2019, J. Mach. Learn. Res..