论文信息 - Characterizing the implicit bias via a primal-dual analysis - 字舞流文

Characterizing the implicit bias via a primal-dual analysis

This paper shows that the implicit bias of gradient descent on linearly separable data is exactly characterized by the optimal solution of a dual optimization problem given by a smoothed margin, even for general losses. This is in contrast to prior results, which are often tailored to exponentially-tailed losses. For the exponential loss specifically, with $n$ training examples and $t$ gradient descent steps, our dual analysis further allows us to prove an $O(\ln(n)/\ln(t))$ convergence rate to the $\ell_2$ maximum margin direction, when a constant step size is used. This rate is tight in both $n$ and $t$, which has not been presented by prior work. On the other hand, with a properly chosen but aggressive step size schedule, we prove $O(1/t)$ rates for both $\ell_2$ margin maximization and implicit bias, whereas prior work (including all first-order methods for the general hard-margin linear SVM problem) proved $\widetilde{O}(1/\sqrt{t})$ margin rates, or $O(1/t)$ margin rates to a suboptimal margin, with an implied (slower) bias rate. Our key observations include that gradient descent on the primal variable naturally induces a mirror descent update on the dual variable, and that the dual objective in this setting is smooth enough to give a faster rate.

Matus Telgarsky | Ziwei Ji | Ziwei Ji | Matus Telgarsky

[1] Matus Telgarsky,et al. Risk and parameter convergence of logistic regression , 2018, ArXiv.

[2] Manfred K. Warmuth,et al. Boosting as entropy projection , 1999, COLT '99.

[3] Nathan Srebro,et al. Implicit Bias of Gradient Descent on Linear Convolutional Networks , 2018, NeurIPS.

[4] Yoav Freund,et al. Boosting: Foundations and Algorithms , 2012 .

[5] Yoav Freund,et al. A decision-theoretic generalization of on-line learning and an application to boosting , 1997, EuroCOLT.

[6] Cynthia Rudin,et al. The Dynamics of AdaBoost: Cyclic Behavior and Convergence of Margins , 2004, J. Mach. Learn. Res..

[7] Nathan Srebro,et al. Convergence of Gradient Descent on Separable Data , 2018, AISTATS.

[8] Adrian S. Lewis,et al. Convex Analysis And Nonlinear Optimization , 2000 .

[9] Shai Shalev-Shwartz,et al. Online learning: theory, algorithms and applications (למידה מקוונת.) , 2007 .

[10] J. Hiriart-Urruty,et al. Fundamentals of Convex Analysis , 2004 .

[11] Nathan Srebro,et al. Characterizing Implicit Bias in Terms of Optimization Geometry , 2018, ICML.

[12] Yoram Singer,et al. Logistic Regression, AdaBoost and Bregman Distances , 2000, Machine Learning.

[13] Shai Shalev-Shwartz,et al. Online Learning and Online Convex Optimization , 2012, Found. Trends Mach. Learn..

[14] Samy Bengio,et al. Understanding deep learning requires rethinking generalization , 2016, ICLR.

[15] Matus Telgarsky,et al. Gradient descent follows the regularization path for general losses , 2020, COLT.

[16] Paul Grigas,et al. AdaBoost and Forward Stagewise Regression are First-Order Convex Optimization Methods , 2013, ArXiv.

[17] Yoram Singer,et al. On the equivalence of weak learnability and linear separability: new relaxations and efficient boosting algorithms , 2010, Machine Learning.

[18] Nathan Srebro,et al. The Implicit Bias of Gradient Descent on Separable Data , 2017, J. Mach. Learn. Res..

[19] Matus Telgarsky,et al. Gradient descent aligns the layers of deep linear networks , 2018, ICLR.

[20] Kaifeng Lyu,et al. Gradient Descent Maximizes the Margin of Homogeneous Neural Networks , 2019, ICLR.

[21] David P. Woodruff,et al. Sublinear Optimization for Machine Learning , 2010, 2010 IEEE 51st Annual Symposium on Foundations of Computer Science.

[22] Francis Bach,et al. Implicit Bias of Gradient Descent for Wide Two-layer Neural Networks Trained with the Logistic Loss , 2020, COLT.

[23] Nathan Srebro,et al. Kernel and Rich Regimes in Overparametrized Models , 2019, COLT.

[24] Yurii Nesterov,et al. Introductory Lectures on Convex Optimization - A Basic Course , 2014, Applied Optimization.

[25] Matus Telgarsky,et al. Spectrally-normalized margin bounds for neural networks , 2017, NIPS.

[26] Matus Telgarsky,et al. Directional convergence and alignment in deep learning , 2020, NeurIPS.

[27] Yoav Freund,et al. Boosting the margin: A new explanation for the effectiveness of voting methods , 1997, ICML.

[28] Geoffrey J. Gordon,et al. Approximate solutions to markov decision processes , 1999 .

[29] Matus Telgarsky,et al. Margins, Shrinkage, and Boosting , 2013, ICML.