On the Implicit Bias of Initialization Shape: Beyond Infinitesimal Mirror Descent

Recent work has highlighted the role of initialization scale in determining the structure of the solutions that gradient methods converge to. In particular, it was shown that large initialization leads to the neural tangent kernel regime solution, whereas small initialization leads to so called “rich regimes”. However, the initialization structure is richer than the overall scale alone and involves relative magnitudes of different weights and layers in the network. Here we show that these relative scales, which we refer to as initialization shape, play an important role in determining the learned model. We develop a novel technique for deriving the inductive bias of gradientflow and use it to obtain closed-form implicit regularizers for multiple cases of interest.

[1]  Matus Telgarsky,et al.  Gradient descent aligns the layers of deep linear networks , 2018, ICLR.

[2]  Nathan Srebro,et al.  Implicit Bias in Deep Linear Classification: Initialization Scale vs Training Accuracy , 2020, NeurIPS.

[3]  俊一 甘利 5分で分かる!? 有名論文ナナメ読み:Jacot, Arthor, Gabriel, Franck and Hongler, Clement : Neural Tangent Kernel : Convergence and Generalization in Neural Networks , 2020 .

[4]  Francis Bach,et al.  On Lazy Training in Differentiable Programming , 2018, NeurIPS.

[5]  Varun Kanade,et al.  Implicit Regularization for Optimal Sparse Recovery , 2019, NeurIPS.

[6]  Wei Hu,et al.  Algorithmic Regularization in Learning Deep Homogeneous Models: Layers are Automatically Balanced , 2018, NeurIPS.

[7]  Francis Bach,et al.  Implicit Bias of Gradient Descent for Wide Two-layer Neural Networks Trained with the Logistic Loss , 2020, COLT.

[8]  Hongyang Zhang,et al.  Algorithmic Regularization in Over-parameterized Matrix Sensing and Neural Networks with Quadratic Activations , 2017, COLT.

[9]  Hossein Mobahi,et al.  A Unifying View on Implicit Bias in Training Linear Neural Networks , 2021, ICLR.

[10]  Kaifeng Lyu,et al.  Gradient Descent Maximizes the Margin of Homogeneous Neural Networks , 2019, ICLR.

[11]  Nathan Srebro,et al.  Kernel and Rich Regimes in Overparametrized Models , 2019, COLT.

[12]  Manfred K. Warmuth,et al.  Reparameterizing Mirror Descent as Gradient Descent , 2020, NeurIPS.

[13]  Nadav Cohen,et al.  Implicit Regularization in Deep Learning May Not Be Explainable by Norms , 2020, NeurIPS.

[14]  Barnabás Póczos,et al.  Gradient Descent Provably Optimizes Over-parameterized Neural Networks , 2018, ICLR.

[15]  Nathan Srebro,et al.  Implicit Regularization in Matrix Factorization , 2017, 2018 Information Theory and Applications Workshop (ITA).

[16]  Kalyanmoy Deb,et al.  Approximate KKT points and a proximity measure for termination , 2013, J. Glob. Optim..

[17]  Ohad Shamir,et al.  Implicit Regularization in ReLU Networks with the Square Loss , 2020, COLT.

[18]  Nathan Srebro,et al.  Lexicographic and Depth-Sensitive Margins in Homogeneous and Non-Homogeneous Deep Models , 2019, ICML.

[19]  Kaifeng Lyu,et al.  Towards Resolving the Implicit Bias of Gradient Descent for Matrix Factorization: Greedy Low-Rank Learning , 2021, ICLR.

[20]  Nathan Srebro,et al.  Characterizing Implicit Bias in Terms of Optimization Geometry , 2018, ICML.

[21]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[22]  Quynh Nguyen,et al.  On the Proof of Global Convergence of Gradient Descent for Deep ReLU Networks with Linear Widths , 2021, ICML.

[23]  Manfred K. Warmuth,et al.  Winnowing with Gradient Descent , 2020, COLT.

[24]  Nathan Srebro,et al.  Implicit Bias of Gradient Descent on Linear Convolutional Networks , 2018, NeurIPS.

[25]  Nathan Srebro,et al.  Mirrorless Mirror Descent: A More Natural Discretization of Riemannian Gradient Flow , 2020, ArXiv.