论文信息 - A Precise High-Dimensional Asymptotic Theory for Boosting and Min-L1-Norm Interpolated Classifiers

A Precise High-Dimensional Asymptotic Theory for Boosting and Min-L1-Norm Interpolated Classifiers

This paper establishes a precise high-dimensional asymptotic theory for boosting on separable data, taking statistical and computational perspectives. We consider the setting where the number of features (weak learners) $p$ scales with the sample size $n$, in an over-parametrized regime. Under a broad class of statistical models, we provide an exact analysis of the generalization error of boosting, when the algorithm interpolates the training data and maximizes the empirical $\ell_1$-margin. The relation between the boosting test error and the optimal Bayes error is pinned down explicitly. In turn, these precise characterizations resolve several open questions raised in \cite{breiman1999prediction, schapire1998boosting} surrounding boosting. On the computational front, we provide a sharp analysis of the stopping time when boosting approximately maximizes the empirical $\ell_1$ margin. Furthermore, we discover that the larger the overparametrization ratio $p/n$, the smaller the proportion of active features (with zero initialization), and the faster the optimization reaches interpolation. At the heart of our theory lies an in-depth study of the maximum $\ell_1$-margin, which can be accurately described by a new system of non-linear equations; we analyze this margin and the properties of this system, using Gaussian comparison techniques and a novel uniform deviation argument. Variants of AdaBoost corresponding to general $\ell_q$ geometry, for $q > 1$, are also presented, together with an exact analysis of the high-dimensional generalization and optimization behavior of a class of these algorithms.

Tengyuan Liang | P. Sur | Tengyuan Liang | Pragya Sur

[1] Thomas M. Cover,et al. Geometrical and Statistical Properties of Systems of Linear Inequalities with Applications in Pattern Recognition , 1965, IEEE Trans. Electron. Comput..

[2] A. Albert,et al. On the existence of maximum likelihood estimates in logistic regression models , 1984 .

[3] Y. Gordon. Some inequalities for Gaussian processes and applications , 1985 .

[4] Thomas J. Santner,et al. A note on A. Albert and J. A. Anderson's conditions for the existence of maximum likelihood estimates in logistic regression models , 1986 .

[5] E. Gardner. The space of interactions in neural network models , 1988 .

[6] Y. Gordon. On Milman's inequality and random subspaces which escape through a mesh in ℝ n , 1988 .

[7] Emmanuel Lesaffre,et al. Partial Separation in Logistic Discrimination , 1989 .

[8] Yoav Freund,et al. Boosting a weak learning algorithm by majority , 1995, COLT '90.

[9] Yoav Freund,et al. Boosting a weak learning algorithm by majority , 1990, COLT '90.

[10] Corinna Cortes,et al. Boosting Decision Trees , 1995, NIPS.

[11] Yoav Freund,et al. A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[12] Yoav Freund,et al. Experiments with a New Boosting Algorithm , 1996, ICML.

[13] J. Ross Quinlan,et al. Bagging, Boosting, and C4.5 , 1996, AAAI/IAAI, Vol. 1.

[14] Leo Breiman,et al. Bias, Variance , And Arcing Classifiers , 1996 .

[15] Yoav Freund,et al. Boosting the margin: A new explanation for the effectiveness of voting methods , 1997, ICML.

[16] Dale Schuurmans,et al. Boosting in the Limit: Maximizing the Margin of Learned Ensembles , 1998, AAAI/IAAI.

[17] L. Breiman. Arcing Classifiers , 1998 .

[18] Peter L. Bartlett,et al. Boosting Algorithms as Gradient Descent , 1999, NIPS.

[19] Leo Breiman,et al. Prediction Games and Arcing Algorithms , 1999, Neural Computation.

[20] J. Friedman. Special Invited Paper-Additive logistic regression: A statistical view of boosting , 2000 .

[21] J. Friedman. Greedy function approximation: A gradient boosting machine. , 2001 .

[22] B. Yu,et al. Boosting with the L 2-loss regression and classification , 2001 .

[23] Shie Mannor,et al. Geometric Bounds for Generalization in Boosting , 2001, COLT/EuroCOLT.

[24] Wenxin Jiang,et al. Some Theoretical Aspects of Boosting in the Presence of Noisy Data , 2001, ICML.

[25] M. Shcherbina,et al. Rigorous Solution of the Gardner Problem , 2001, math-ph/0112003.

[26] V. Koltchinskii,et al. Empirical margin distributions and bounding the generalization error of combined classifiers , 2002, math/0405343.

[27] Shie Mannor,et al. The Consistency of Greedy Algorithms for Classification , 2002, COLT.

[28] P. Bühlmann,et al. Boosting With the L2 Loss , 2003 .

[29] Wenxin Jiang. Process consistency for AdaBoost , 2003 .

[30] Tong Zhang. Statistical behavior and consistency of classification methods based on convex risk minimization , 2003 .

[31] G. Lugosi,et al. On the Bayes-risk consistency of regularized boosting methods , 2003 .

[32] Gilles Blanchard,et al. On the Rate of Convergence of Regularized Boosting Classifiers , 2003, J. Mach. Learn. Res..

[33] L. Breiman. Population theory for boosting ensembles , 2003 .

[34] Shie Mannor,et al. On the Existence of Linear Weak Learners and Applications to Boosting , 2002, Machine Learning.

[35] Gunnar Rätsch,et al. Soft Margins for AdaBoost , 2001, Machine Learning.

[36] Yoram Singer,et al. Logistic Regression, AdaBoost and Bregman Distances , 2000, Machine Learning.

[37] Ji Zhu,et al. Boosting as a Regularized Path to a Maximum Margin Classifier , 2004, J. Mach. Learn. Res..

[38] Gunnar Rätsch,et al. Efficient Margin Maximizing with Boosting , 2005, J. Mach. Learn. Res..

[39] R. Schapire. The Strength of Weak Learnability , 1990, Machine Learning.

[40] Bin Yu,et al. Boosting with early stopping: Convergence and consistency , 2005, math/0508276.

[41] V. Koltchinskii,et al. Complexities of convex combinations and bounding the generalization error in classification , 2004, math/0405356.

[42] Vladimir Koltchinskii,et al. Exponential Convergence Rates in Classification , 2005, COLT.

[43] Robert E. Schapire,et al. How boosting the margin can also boost classifier complexity , 2006, ICML.

[44] Peter Buhlmann. Boosting for high-dimensional linear models , 2006, math/0606789.

[45] P. Bühlmann,et al. Sparse Boosting , 2006, J. Mach. Learn. Res..

[46] Peter L. Bartlett,et al. AdaBoost is Consistent , 2006, J. Mach. Learn. Res..

[47] P. Bickel,et al. Some Theory for Generalized Boosting Algorithms , 2006, J. Mach. Learn. Res..

[48] Peter Buhlmann,et al. BOOSTING ALGORITHMS: REGULARIZATION, PREDICTION AND MODEL FITTING , 2007, 0804.2752.

[49] R. Schapire,et al. Analysis of boosting algorithms using the smooth margin function , 2007, 0803.4092.

[50] AI Koan,et al. Weighted Sums of Random Kitchen Sinks: Replacing minimization with randomization in learning , 2008, NIPS.

[51] C. Villani. Optimal Transport: Old and New , 2008 .

[52] Andrea Montanari,et al. Message-passing algorithms for compressed sensing , 2009, Proceedings of the National Academy of Sciences.

[53] Torsten Hothorn,et al. Twin Boosting: improved feature selection and prediction , 2010, Stat. Comput..

[54] Yoram Singer,et al. On the equivalence of weak learnability and linear separability: new relaxations and efficient boosting algorithms , 2010, Machine Learning.

[55] Cynthia Rudin,et al. The Rate of Convergence of Adaboost , 2011, COLT.

[56] Matus Telgarsky,et al. Margins, Shrinkage, and Boosting , 2013, ICML.

[57] Mihailo Stojnic,et al. A framework to characterize performance of LASSO algorithms , 2013, ArXiv.

[58] L. Ambrosio,et al. A User’s Guide to Optimal Transport , 2013 .

[59] P. Bickel,et al. On robust regression with high-dimensional predictors , 2013, Proceedings of the National Academy of Sciences.

[60] Mihailo Stojnic,et al. Meshes that trap random subspaces , 2013, ArXiv.

[61] Paul Grigas,et al. AdaBoost and Forward Stagewise Regression are First-Order Convex Optimization Methods , 2013, ArXiv.

[62] Andrea Montanari,et al. High dimensional robust M-estimation: asymptotic variance via approximate message passing , 2013, Probability Theory and Related Fields.

[63] Christos Thrampoulidis,et al. A Tight Version of the Gaussian min-max theorem in the Presence of Convexity , 2014, ArXiv.

[64] Stephen P. Boyd,et al. Proximal Algorithms , 2013, Found. Trends Optim..

[65] Paul Grigas,et al. A New Perspective on Boosting in Linear Regression via Subgradient Optimization and Relatives , 2015, ArXiv.

[66] Christos Thrampoulidis,et al. Regularized Linear Regression: A Precise Analysis of the Estimation Error , 2015, COLT.

[67] Alexander Hanbo Li,et al. Boosting in the Presence of Outliers: Adaptive Classification With Nonconvex Loss Functions , 2015, ArXiv.

[68] Christos Thrampoulidis,et al. LASSO with Non-linear Measurements is Equivalent to One With Linear Measurements , 2015, NIPS.

[69] Samy Bengio,et al. Understanding deep learning requires rethinking generalization , 2016, ICLR.

[70] Babak Hassibi,et al. A Universal Analysis of Large-Scale Regularized Least Squares Solutions , 2017, NIPS.

[71] Francis R. Bach,et al. Breaking the Curse of Dimensionality with Convex Neural Networks , 2014, J. Mach. Learn. Res..

[72] Mikhail Belkin,et al. To understand deep learning we need to understand kernel learning , 2018, ICML.

[73] Nathan Srebro,et al. Characterizing Implicit Bias in Terms of Optimization Geometry , 2018, ICML.

[74] Tengyuan Liang,et al. Just Interpolate: Kernel "Ridgeless" Regression Can Generalize , 2018, The Annals of Statistics.

[75] Alexandra Chouldechova,et al. The Frontiers of Fairness in Machine Learning , 2018, ArXiv.

[76] Christos Thrampoulidis,et al. Precise Error Analysis of Regularized $M$ -Estimators in High Dimensions , 2016, IEEE Transactions on Information Theory.

[77] E. Candès,et al. The phase transition for the existence of the maximum likelihood estimate in high-dimensional logistic regression , 2018, The Annals of Statistics.

[78] Noureddine El Karoui,et al. On the impact of predictor geometry on the performance on high-dimensional ridge-regularized generalized robust regression estimators , 2018 .

[79] Mikhail Belkin,et al. Reconciling modern machine learning and the bias-variance trade-off , 2018, ArXiv.

[80] Zachary Chase Lipton. The mythos of model interpretability , 2016, ACM Queue.

[81] Noureddine El Karoui. On the impact of predictor geometry on the performance on high-dimensional ridge-regularized generalized robust regression estimators , 2018 .

[82] Mikhail Belkin,et al. Reconciling modern machine-learning practice and the classical bias–variance trade-off , 2018, Proceedings of the National Academy of Sciences.

[83] Christos Thrampoulidis,et al. Phase Retrieval via Polytope Optimization: Geometry, Phase Transitions, and New Algorithms , 2018, ArXiv.

[84] Tengyuan Liang,et al. On the Risk of Minimum-Norm Interpolants and Restricted Lower Isometry of Kernels , 2019, ArXiv.

[85] Yuxin Chen,et al. The likelihood ratio test in high-dimensional logistic regression is asymptotically a rescaled Chi-square , 2017, Probability Theory and Related Fields.

[86] A. Montanari,et al. The generalization error of max-margin linear classifiers: High-dimensional asymptotics in the overparametrized regime , 2019 .

[87] Hong Hu,et al. Asymptotics and Optimal Designs of SLOPE for Sparse Linear Regression , 2019, 2019 IEEE International Symposium on Information Theory (ISIT).

[88] Christos Thrampoulidis,et al. A Model of Double Descent for High-dimensional Binary Linear Classification , 2019, Information and Inference: A Journal of the IMA.

[89] Adrian Weller,et al. Transparency: Motivations and Challenges , 2019, Explainable AI.

[90] E. Candès,et al. A modern maximum-likelihood theory for high-dimensional logistic regression , 2018, Proceedings of the National Academy of Sciences.

[91] A. Montanari,et al. MEAN FIELD ASYMPTOTICS IN HIGH-DIMENSIONAL STATISTICS: FROM EXACT RESULTS TO EFFICIENT ALGORITHMS , 2019, Proceedings of the International Congress of Mathematicians (ICM 2018).

[92] Tengyuan Liang,et al. Training Neural Networks as Learning Data-adaptive Kernels: Provable Representation and Approximation Benefits , 2019, Journal of the American Statistical Association.

[93] Jon M. Kleinberg,et al. Simplicity Creates Inequity: Implications for Fairness, Stereotypes, and Interpretability , 2018, EC.

[94] Cynthia Rudin,et al. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead , 2018, Nature Machine Intelligence.

[95] Andrea Montanari,et al. Surprises in High-Dimensional Ridgeless Least Squares Interpolation , 2019, Annals of statistics.

[96] Mikhail Belkin,et al. Does data interpolation contradict statistical optimality? , 2018, AISTATS.

[97] Yue M. Lu,et al. Universality Laws for High-Dimensional Learning with Random Features , 2020, ArXiv.

[98] Tengyuan Liang,et al. Mehler’s Formula, Branching Process, and Compositional Kernels of Deep Neural Networks , 2020, Journal of the American Statistical Association.

[99] Tengyuan Liang,et al. On the Multiple Descent of Minimum-Norm Interpolants and Restricted Lower Isometry of Kernels , 2019, COLT.

[100] Andrea Montanari,et al. The Lasso with general Gaussian designs with applications to hypothesis testing , 2020, ArXiv.

[101] Francis Bach,et al. Implicit Bias of Gradient Descent for Wide Two-layer Neural Networks Trained with the Logistic Loss , 2020, COLT.

[102] Emmanuel J. Candes,et al. The asymptotic distribution of the MLE in high-dimensional logistic models: Arbitrary covariance , 2020, Bernoulli.

[103] Manfred K. Warmuth,et al. Winnowing with Gradient Descent , 2020, COLT.

[104] Philip M. Long,et al. Benign overfitting in linear regression , 2019, Proceedings of the National Academy of Sciences.

[105] A. Maleki,et al. Which bridge estimator is the best for variable selection? , 2020 .

[106] Mohamed-Slim Alouini,et al. Precise Error Analysis of the LASSO under Correlated Designs , 2020, ArXiv.

[107] Mikhail Belkin,et al. Two models of double descent for weak features , 2019, SIAM J. Math. Data Sci..

[108] Florentina Bunea,et al. Interpolation under latent factor regression models , 2020, ArXiv.

[109] Matus Telgarsky,et al. Characterizing the implicit bias via a primal-dual analysis , 2019, ALT.

[110] Andrea Montanari,et al. The Generalization Error of Random Features Regression: Precise Asymptotics and the Double Descent Curve , 2019, Communications on Pure and Applied Mathematics.

[111] Hanwen Huang. LASSO risk and phase transition under dependence , 2021 .

[112] Philip M. Long,et al. Finite-sample analysis of interpolating linear classifiers in the overparameterized regime , 2020, ArXiv.