论文信息 - Lexicographic and Depth-Sensitive Margins in Homogeneous and Non-Homogeneous Deep Models - 字舞流文

Lexicographic and Depth-Sensitive Margins in Homogeneous and Non-Homogeneous Deep Models

With an eye toward understanding complexity control in deep learning, we study how infinitesimal regularization or gradient descent optimization lead to margin maximizing solutions in both homogeneous and non-homogeneous models, extending previous work that focused on infinitesimal regularization only in homogeneous models. To this end we study the limit of loss minimization with a diverging norm constraint (the "constrained path"), relate it to the limit of a "margin path" and characterize the resulting solution. For non-homogeneous ensemble models, which output is a sum of homogeneous sub-models, we show that this solution discards the shallowest sub-models if they are unnecessary. For homogeneous models, we show convergence to a "lexicographic max-margin solution", and provide conditions under which max-margin solutions are also attained as the limit of unconstrained gradient descent.

Nathan Srebro | Daniel Soudry | Jason D. Lee | Suriya Gunasekar | Mor Shpigel Nacson | Nathan Srebro | Suriya Gunasekar | Daniel Soudry | J. Lee | N. Srebro | M. S. Nacson

[1] D K Smith,et al. Numerical Optimization , 2001, J. Oper. Res. Soc..

[2] Ji Zhu,et al. Margin Maximizing Loss Functions , 2003, NIPS.

[3] Ji Zhu,et al. Boosting as a Regularized Path to a Maximum Margin Classifier , 2004, J. Mach. Learn. Res..

[4] Ryota Tomioka,et al. In Search of the Real Inductive Bias: On the Role of Implicit Regularization in Deep Learning , 2014, ICLR.

[5] Ruslan Salakhutdinov,et al. Path-SGD: Path-Normalized Optimization in Deep Neural Networks , 2015, NIPS.

[6] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7] Nathan Srebro,et al. The Marginal Value of Adaptive Gradient Methods in Machine Learning , 2017, NIPS.

[8] Lei Wu,et al. Towards Understanding Generalization of Deep Learning: Perspective of Loss Landscapes , 2017, ArXiv.

[9] Samy Bengio,et al. Understanding deep learning requires rethinking generalization , 2016, ICLR.

[10] Elad Hoffer,et al. Train longer, generalize better: closing the generalization gap in large batch training of neural networks , 2017, NIPS.

[11] Jorge Nocedal,et al. On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima , 2016, ICLR.

[12] Ruslan Salakhutdinov,et al. Geometry of Optimization and Implicit Regularization in Deep Learning , 2017, ArXiv.

[13] Zachary C. Lipton,et al. Weighted Risk Minimization & Deep Learning , 2018, ArXiv.

[14] Nathan Srebro,et al. The Implicit Bias of Gradient Descent on Separable Data , 2017, J. Mach. Learn. Res..

[15] Nathan Srebro,et al. Characterizing Implicit Bias in Terms of Optimization Geometry , 2018, ICML.

[16] Matus Telgarsky,et al. Risk and parameter convergence of logistic regression , 2018, ArXiv.

[17] Qiang Liu,et al. On the Margin Theory of Feedforward Neural Networks , 2018, ArXiv.

[18] Nathan Srebro,et al. Implicit Regularization in Matrix Factorization , 2017, 2018 Information Theory and Applications Workshop (ITA).

[19] Nathan Srebro,et al. Implicit Bias of Gradient Descent on Linear Convolutional Networks , 2018, NeurIPS.

[20] Yi Zhou,et al. When Will Gradient Methods Converge to Max-margin Classifier under ReLU Models? , 2018 .

[21] Pradeep Ravikumar,et al. Connecting Optimization and Regularization Paths , 2018, NeurIPS.

[22] S. Sastry,et al. Cross-Entropy Loss Leads To Poor Margins , 2018 .

[23] Matus Telgarsky,et al. Gradient descent aligns the layers of deep linear networks , 2018, ICLR.

[24] Nathan Srebro,et al. Convergence of Gradient Descent on Separable Data , 2018, AISTATS.

[25] Colin Wei,et al. Regularization Matters: Generalization and Optimization of Neural Nets v.s. their Induced Kernel , 2018, NeurIPS.

[26] Zachary C. Lipton,et al. What is the Effect of Importance Weighting in Deep Learning? , 2018, ICML.

[27] Nathan Srebro,et al. Stochastic Gradient Descent on Separable Data: Exact Convergence with a Fixed Learning Rate , 2018, AISTATS.

[28] J. Zico Kolter,et al. A Continuous-Time View of Early Stopping for Least Squares Regression , 2018, AISTATS.