A Precise High-Dimensional Asymptotic Theory for Boosting and Min-L1-Norm Interpolated Classifiers

This paper establishes a precise high-dimensional asymptotic theory for boosting on separable data, taking statistical and computational perspectives. We consider the setting where the number of features (weak learners) $p$ scales with the sample size $n$, in an over-parametrized regime. Under a broad class of statistical models, we provide an exact analysis of the generalization error of boosting, when the algorithm interpolates the training data and maximizes the empirical $\ell_1$-margin. The relation between the boosting test error and the optimal Bayes error is pinned down explicitly. In turn, these precise characterizations resolve several open questions raised in \cite{breiman1999prediction, schapire1998boosting} surrounding boosting. On the computational front, we provide a sharp analysis of the stopping time when boosting approximately maximizes the empirical $\ell_1$ margin. Furthermore, we discover that the larger the overparametrization ratio $p/n$, the smaller the proportion of active features (with zero initialization), and the faster the optimization reaches interpolation. At the heart of our theory lies an in-depth study of the maximum $\ell_1$-margin, which can be accurately described by a new system of non-linear equations; we analyze this margin and the properties of this system, using Gaussian comparison techniques and a novel uniform deviation argument. Variants of AdaBoost corresponding to general $\ell_q$ geometry, for $q > 1$, are also presented, together with an exact analysis of the high-dimensional generalization and optimization behavior of a class of these algorithms.

[1]  Thomas M. Cover,et al.  Geometrical and Statistical Properties of Systems of Linear Inequalities with Applications in Pattern Recognition , 1965, IEEE Trans. Electron. Comput..

[2]  A. Albert,et al.  On the existence of maximum likelihood estimates in logistic regression models , 1984 .

[3]  Y. Gordon Some inequalities for Gaussian processes and applications , 1985 .

[4]  Thomas J. Santner,et al.  A note on A. Albert and J. A. Anderson's conditions for the existence of maximum likelihood estimates in logistic regression models , 1986 .

[5]  E. Gardner The space of interactions in neural network models , 1988 .

[6]  Y. Gordon On Milman's inequality and random subspaces which escape through a mesh in ℝ n , 1988 .

[7]  Emmanuel Lesaffre,et al.  Partial Separation in Logistic Discrimination , 1989 .

[8]  Yoav Freund,et al.  Boosting a weak learning algorithm by majority , 1995, COLT '90.

[9]  Yoav Freund,et al.  Boosting a weak learning algorithm by majority , 1990, COLT '90.

[10]  Corinna Cortes,et al.  Boosting Decision Trees , 1995, NIPS.

[11]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[12]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[13]  J. Ross Quinlan,et al.  Bagging, Boosting, and C4.5 , 1996, AAAI/IAAI, Vol. 1.

[14]  Leo Breiman,et al.  Bias, Variance , And Arcing Classifiers , 1996 .

[15]  Yoav Freund,et al.  Boosting the margin: A new explanation for the effectiveness of voting methods , 1997, ICML.

[16]  Dale Schuurmans,et al.  Boosting in the Limit: Maximizing the Margin of Learned Ensembles , 1998, AAAI/IAAI.

[17]  L. Breiman Arcing Classifiers , 1998 .

[18]  Peter L. Bartlett,et al.  Boosting Algorithms as Gradient Descent , 1999, NIPS.

[19]  Leo Breiman,et al.  Prediction Games and Arcing Algorithms , 1999, Neural Computation.

[20]  J. Friedman Special Invited Paper-Additive logistic regression: A statistical view of boosting , 2000 .

[21]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[22]  B. Yu,et al.  Boosting with the L 2-loss regression and classification , 2001 .

[23]  Shie Mannor,et al.  Geometric Bounds for Generalization in Boosting , 2001, COLT/EuroCOLT.

[24]  Wenxin Jiang,et al.  Some Theoretical Aspects of Boosting in the Presence of Noisy Data , 2001, ICML.

[25]  M. Shcherbina,et al.  Rigorous Solution of the Gardner Problem , 2001, math-ph/0112003.

[26]  V. Koltchinskii,et al.  Empirical margin distributions and bounding the generalization error of combined classifiers , 2002, math/0405343.

[27]  Shie Mannor,et al.  The Consistency of Greedy Algorithms for Classification , 2002, COLT.

[28]  P. Bühlmann,et al.  Boosting With the L2 Loss , 2003 .

[29]  Wenxin Jiang Process consistency for AdaBoost , 2003 .

[30]  Tong Zhang Statistical behavior and consistency of classification methods based on convex risk minimization , 2003 .

[31]  G. Lugosi,et al.  On the Bayes-risk consistency of regularized boosting methods , 2003 .

[32]  Gilles Blanchard,et al.  On the Rate of Convergence of Regularized Boosting Classifiers , 2003, J. Mach. Learn. Res..

[33]  L. Breiman Population theory for boosting ensembles , 2003 .

[34]  Shie Mannor,et al.  On the Existence of Linear Weak Learners and Applications to Boosting , 2002, Machine Learning.

[35]  Gunnar Rätsch,et al.  Soft Margins for AdaBoost , 2001, Machine Learning.

[36]  Yoram Singer,et al.  Logistic Regression, AdaBoost and Bregman Distances , 2000, Machine Learning.

[37]  Ji Zhu,et al.  Boosting as a Regularized Path to a Maximum Margin Classifier , 2004, J. Mach. Learn. Res..

[38]  Gunnar Rätsch,et al.  Efficient Margin Maximizing with Boosting , 2005, J. Mach. Learn. Res..

[39]  R. Schapire The Strength of Weak Learnability , 1990, Machine Learning.

[40]  Bin Yu,et al.  Boosting with early stopping: Convergence and consistency , 2005, math/0508276.

[41]  V. Koltchinskii,et al.  Complexities of convex combinations and bounding the generalization error in classification , 2004, math/0405356.

[42]  Vladimir Koltchinskii,et al.  Exponential Convergence Rates in Classification , 2005, COLT.

[43]  Robert E. Schapire,et al.  How boosting the margin can also boost classifier complexity , 2006, ICML.

[44]  Peter Buhlmann Boosting for high-dimensional linear models , 2006, math/0606789.

[45]  P. Bühlmann,et al.  Sparse Boosting , 2006, J. Mach. Learn. Res..

[46]  Peter L. Bartlett,et al.  AdaBoost is Consistent , 2006, J. Mach. Learn. Res..

[47]  P. Bickel,et al.  Some Theory for Generalized Boosting Algorithms , 2006, J. Mach. Learn. Res..

[48]  Peter Buhlmann,et al.  BOOSTING ALGORITHMS: REGULARIZATION, PREDICTION AND MODEL FITTING , 2007, 0804.2752.

[49]  R. Schapire,et al.  Analysis of boosting algorithms using the smooth margin function , 2007, 0803.4092.

[50]  AI Koan,et al.  Weighted Sums of Random Kitchen Sinks: Replacing minimization with randomization in learning , 2008, NIPS.

[51]  C. Villani Optimal Transport: Old and New , 2008 .

[52]  Andrea Montanari,et al.  Message-passing algorithms for compressed sensing , 2009, Proceedings of the National Academy of Sciences.

[53]  Torsten Hothorn,et al.  Twin Boosting: improved feature selection and prediction , 2010, Stat. Comput..

[54]  Yoram Singer,et al.  On the equivalence of weak learnability and linear separability: new relaxations and efficient boosting algorithms , 2010, Machine Learning.

[55]  Cynthia Rudin,et al.  The Rate of Convergence of Adaboost , 2011, COLT.

[56]  Matus Telgarsky,et al.  Margins, Shrinkage, and Boosting , 2013, ICML.

[57]  Mihailo Stojnic,et al.  A framework to characterize performance of LASSO algorithms , 2013, ArXiv.

[58]  L. Ambrosio,et al.  A User’s Guide to Optimal Transport , 2013 .

[59]  P. Bickel,et al.  On robust regression with high-dimensional predictors , 2013, Proceedings of the National Academy of Sciences.

[60]  Mihailo Stojnic,et al.  Meshes that trap random subspaces , 2013, ArXiv.

[61]  Paul Grigas,et al.  AdaBoost and Forward Stagewise Regression are First-Order Convex Optimization Methods , 2013, ArXiv.

[62]  Andrea Montanari,et al.  High dimensional robust M-estimation: asymptotic variance via approximate message passing , 2013, Probability Theory and Related Fields.

[63]  Christos Thrampoulidis,et al.  A Tight Version of the Gaussian min-max theorem in the Presence of Convexity , 2014, ArXiv.

[64]  Stephen P. Boyd,et al.  Proximal Algorithms , 2013, Found. Trends Optim..

[65]  Paul Grigas,et al.  A New Perspective on Boosting in Linear Regression via Subgradient Optimization and Relatives , 2015, ArXiv.

[66]  Christos Thrampoulidis,et al.  Regularized Linear Regression: A Precise Analysis of the Estimation Error , 2015, COLT.

[67]  Alexander Hanbo Li,et al.  Boosting in the Presence of Outliers: Adaptive Classification With Nonconvex Loss Functions , 2015, ArXiv.

[68]  Christos Thrampoulidis,et al.  LASSO with Non-linear Measurements is Equivalent to One With Linear Measurements , 2015, NIPS.

[69]  Samy Bengio,et al.  Understanding deep learning requires rethinking generalization , 2016, ICLR.

[70]  Babak Hassibi,et al.  A Universal Analysis of Large-Scale Regularized Least Squares Solutions , 2017, NIPS.

[71]  Francis R. Bach,et al.  Breaking the Curse of Dimensionality with Convex Neural Networks , 2014, J. Mach. Learn. Res..

[72]  Mikhail Belkin,et al.  To understand deep learning we need to understand kernel learning , 2018, ICML.

[73]  Nathan Srebro,et al.  Characterizing Implicit Bias in Terms of Optimization Geometry , 2018, ICML.

[74]  Tengyuan Liang,et al.  Just Interpolate: Kernel "Ridgeless" Regression Can Generalize , 2018, The Annals of Statistics.

[75]  Alexandra Chouldechova,et al.  The Frontiers of Fairness in Machine Learning , 2018, ArXiv.

[76]  Christos Thrampoulidis,et al.  Precise Error Analysis of Regularized $M$ -Estimators in High Dimensions , 2016, IEEE Transactions on Information Theory.

[77]  E. Candès,et al.  The phase transition for the existence of the maximum likelihood estimate in high-dimensional logistic regression , 2018, The Annals of Statistics.

[78]  Noureddine El Karoui,et al.  On the impact of predictor geometry on the performance on high-dimensional ridge-regularized generalized robust regression estimators , 2018 .

[79]  Mikhail Belkin,et al.  Reconciling modern machine learning and the bias-variance trade-off , 2018, ArXiv.

[80]  Zachary Chase Lipton The mythos of model interpretability , 2016, ACM Queue.

[81]  Noureddine El Karoui On the impact of predictor geometry on the performance on high-dimensional ridge-regularized generalized robust regression estimators , 2018 .

[82]  Mikhail Belkin,et al.  Reconciling modern machine-learning practice and the classical bias–variance trade-off , 2018, Proceedings of the National Academy of Sciences.

[83]  Christos Thrampoulidis,et al.  Phase Retrieval via Polytope Optimization: Geometry, Phase Transitions, and New Algorithms , 2018, ArXiv.

[84]  Tengyuan Liang,et al.  On the Risk of Minimum-Norm Interpolants and Restricted Lower Isometry of Kernels , 2019, ArXiv.

[85]  Yuxin Chen,et al.  The likelihood ratio test in high-dimensional logistic regression is asymptotically a rescaled Chi-square , 2017, Probability Theory and Related Fields.

[86]  A. Montanari,et al.  The generalization error of max-margin linear classifiers: High-dimensional asymptotics in the overparametrized regime , 2019 .

[87]  Hong Hu,et al.  Asymptotics and Optimal Designs of SLOPE for Sparse Linear Regression , 2019, 2019 IEEE International Symposium on Information Theory (ISIT).

[88]  Christos Thrampoulidis,et al.  A Model of Double Descent for High-dimensional Binary Linear Classification , 2019, Information and Inference: A Journal of the IMA.

[89]  Adrian Weller,et al.  Transparency: Motivations and Challenges , 2019, Explainable AI.

[90]  E. Candès,et al.  A modern maximum-likelihood theory for high-dimensional logistic regression , 2018, Proceedings of the National Academy of Sciences.

[91]  A. Montanari,et al.  MEAN FIELD ASYMPTOTICS IN HIGH-DIMENSIONAL STATISTICS: FROM EXACT RESULTS TO EFFICIENT ALGORITHMS , 2019, Proceedings of the International Congress of Mathematicians (ICM 2018).

[92]  Tengyuan Liang,et al.  Training Neural Networks as Learning Data-adaptive Kernels: Provable Representation and Approximation Benefits , 2019, Journal of the American Statistical Association.

[93]  Jon M. Kleinberg,et al.  Simplicity Creates Inequity: Implications for Fairness, Stereotypes, and Interpretability , 2018, EC.

[94]  Cynthia Rudin,et al.  Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead , 2018, Nature Machine Intelligence.

[95]  Andrea Montanari,et al.  Surprises in High-Dimensional Ridgeless Least Squares Interpolation , 2019, Annals of statistics.

[96]  Mikhail Belkin,et al.  Does data interpolation contradict statistical optimality? , 2018, AISTATS.

[97]  Yue M. Lu,et al.  Universality Laws for High-Dimensional Learning with Random Features , 2020, ArXiv.

[98]  Tengyuan Liang,et al.  Mehler’s Formula, Branching Process, and Compositional Kernels of Deep Neural Networks , 2020, Journal of the American Statistical Association.

[99]  Tengyuan Liang,et al.  On the Multiple Descent of Minimum-Norm Interpolants and Restricted Lower Isometry of Kernels , 2019, COLT.

[100]  Andrea Montanari,et al.  The Lasso with general Gaussian designs with applications to hypothesis testing , 2020, ArXiv.

[101]  Francis Bach,et al.  Implicit Bias of Gradient Descent for Wide Two-layer Neural Networks Trained with the Logistic Loss , 2020, COLT.

[102]  Emmanuel J. Candes,et al.  The asymptotic distribution of the MLE in high-dimensional logistic models: Arbitrary covariance , 2020, Bernoulli.

[103]  Manfred K. Warmuth,et al.  Winnowing with Gradient Descent , 2020, COLT.

[104]  Philip M. Long,et al.  Benign overfitting in linear regression , 2019, Proceedings of the National Academy of Sciences.

[105]  A. Maleki,et al.  Which bridge estimator is the best for variable selection? , 2020 .

[106]  Mohamed-Slim Alouini,et al.  Precise Error Analysis of the LASSO under Correlated Designs , 2020, ArXiv.

[107]  Mikhail Belkin,et al.  Two models of double descent for weak features , 2019, SIAM J. Math. Data Sci..

[108]  Florentina Bunea,et al.  Interpolation under latent factor regression models , 2020, ArXiv.

[109]  Matus Telgarsky,et al.  Characterizing the implicit bias via a primal-dual analysis , 2019, ALT.

[110]  Andrea Montanari,et al.  The Generalization Error of Random Features Regression: Precise Asymptotics and the Double Descent Curve , 2019, Communications on Pure and Applied Mathematics.

[111]  Hanwen Huang LASSO risk and phase transition under dependence , 2021 .

[112]  Philip M. Long,et al.  Finite-sample analysis of interpolating linear classifiers in the overparameterized regime , 2020, ArXiv.