Convergence Analysis of Over-parameterized Deep Linear Networks, and the Principal Components Bias

Convolutional Neural networks of different architectures seem to learn to classify images in the same order. To understand this phenomenon, we revisit the over-parametrized deep linear network model. Our analysis of this model’s learning dynamics reveals that the convergence rate of its parameters is exponentially faster along directions corresponding to the larger principal components of the data, at a rate governed by the singular values. We term this convergence pattern the Principal Components bias (PC-bias). We show how the PC-bias streamlines the order of learning of both linear and non-linear networks, more prominently in earlier stages of learning. We then compare our results to the spectral bias, showing that both biases can be seen independently, and affect the order of learning in different ways. Finally, we discuss how the PC-bias can explain several phenomena, including the benefits of prevalent initialization schemes, how early stopping may be related to PCA, and why deep networks converge more slowly when given random labels.

[1]  D. Weinshall,et al.  Let's Agree to Agree: Neural Networks Share Classification Order on Real Datasets , 2019, ICML.

[2]  Kenji Kawaguchi,et al.  Deep Learning without Poor Local Minima , 2016, NIPS.

[3]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[4]  Matus Telgarsky,et al.  Gradient descent aligns the layers of deep linear networks , 2018, ICLR.

[5]  Prateek Jain,et al.  The Pitfalls of Simplicity Bias in Neural Networks , 2020, NeurIPS.

[6]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[7]  Fred Zhang,et al.  SGD on Neural Networks Learns Functions of Increasing Complexity , 2019, NeurIPS.

[8]  Samy Bengio,et al.  Understanding deep learning requires rethinking generalization , 2016, ICLR.

[9]  Samet Oymak,et al.  Gradient Descent with Early Stopping is Provably Robust to Label Noise for Overparameterized Neural Networks , 2019, AISTATS.

[10]  Yoshua Bengio,et al.  On the Spectral Bias of Neural Networks , 2018, ICML.

[11]  Jian Sun,et al.  Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[12]  Andrea Vedaldi,et al.  Deep Image Prior , 2017, International Journal of Computer Vision.

[13]  Yuanzhi Li,et al.  Learning and Generalization in Overparameterized Neural Networks, Going Beyond Two Layers , 2018, NeurIPS.

[14]  Kurt Hornik,et al.  Neural networks and principal component analysis: Learning from examples without local minima , 1989, Neural Networks.

[15]  Yi Zhou,et al.  Critical Points of Linear Neural Networks: Analytical Forms and Landscape Properties , 2017, ICLR.

[16]  Ronen Basri,et al.  The Convergence Rate of Neural Networks for Learned Functions of Different Frequencies , 2019, NeurIPS.

[17]  Tegan Maharaj,et al.  Deep Nets Don't Learn via Memorization , 2017, ICLR.

[18]  Surya Ganguli,et al.  A mathematical theory of semantic development in deep neural networks , 2018, Proceedings of the National Academy of Sciences.

[19]  Eero P. Simoncelli,et al.  Natural image statistics and neural representation. , 2001, Annual review of neuroscience.

[20]  Thomas Laurent,et al.  Deep linear neural networks with arbitrary loss: All local minima are global , 2017, ArXiv.

[21]  Chico Q. Camargo,et al.  Deep learning generalizes because the parameter-function map is biased towards simple functions , 2018, ICLR.

[22]  Chico Q. Camargo,et al.  Input–output maps are strongly biased towards simple outputs , 2018, Nature Communications.

[23]  Amit Daniely,et al.  The Implicit Bias of Depth: How Incremental Learning Drives Generalization , 2020, ICLR.

[24]  Jeffrey Pennington,et al.  The Surprising Simplicity of the Early-Time Learning Dynamics of Neural Networks , 2020, NeurIPS.

[25]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[26]  Nathan Srebro,et al.  Implicit Bias of Gradient Descent on Linear Convolutional Networks , 2018, NeurIPS.

[27]  Kenji Fukumizu,et al.  Effect of Batch Learning in Multilayer Neural Networks , 1998, ICONIP.

[28]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[29]  Reinhard Heckel,et al.  Denoising and Regularization via Exploiting the Structural Bias of Convolutional Generators , 2020, ICLR.

[30]  Yuan Cao,et al.  Towards Understanding the Spectral Bias of Deep Learning , 2021, IJCAI.

[31]  Ruosong Wang,et al.  Fine-Grained Analysis of Optimization and Generalization for Overparameterized Two-Layer Neural Networks , 2019, ICML.

[32]  Yann LeCun,et al.  Second Order Properties of Error Surfaces: Learning Time and Generalization , 1990, NIPS 1990.

[33]  Nathan Srebro,et al.  The Implicit Bias of Gradient Descent on Separable Data , 2017, J. Mach. Learn. Res..

[34]  Surya Ganguli,et al.  Exact solutions to the nonlinear dynamics of learning in deep linear neural networks , 2013, ICLR.

[35]  Ronen Basri,et al.  Frequency Bias in Neural Networks for Input of Non-Uniform Density , 2020, ICML.

[36]  Arthur Jacot,et al.  Neural tangent kernel: convergence and generalization in neural networks (invited paper) , 2018, NeurIPS.

[37]  Sanjeev Arora,et al.  On the Optimization of Deep Networks: Implicit Acceleration by Overparameterization , 2018, ICML.