Exact solutions of a deep linear network

This work finds the analytical expression of the global minima of a deep linear network with weight decay and stochastic neurons, a fundamental model for understanding the landscape of neural networks. Our result implies that the origin is a special point in deep neural network loss landscape where highly nonlinear phenomenon emerges. We show that weight decay strongly interacts with the model architecture and can create bad minima at zero in a network with more than $1$ hidden layer, qualitatively different from a network with only $1$ hidden layer. Practically, our result implies that common deep learning initialization methods are insufficient to ease the optimization of neural networks in general.

[1]  Liu Ziyin,et al.  spred: Solving L1 Penalty with SGD , 2022, ICML.

[2]  Hidenori Tanaka,et al.  What shapes the loss landscape of self-supervised learning? , 2022, ICLR.

[3]  Liu Ziyin,et al.  Exact Phase Transitions in Deep Learning , 2022, ArXiv.

[4]  Akshay Rangamani,et al.  Neural Collapse in Deep Homogeneous Classifiers and The Role of Weight Decay , 2022, ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Zihao Wang,et al.  Posterior Collapse of a Linear Latent Variable Model , 2022, NeurIPS.

[6]  Eric P. Xing,et al.  Stochastic Neural Networks with Infinite Width are Deterministic , 2022, ArXiv.

[7]  Andrej Risteski,et al.  Variational autoencoders in the presence of low-dimensional data: landscape and implicit bias , 2021, ICLR.

[8]  James B. Simon,et al.  SGD with a Constant Large Learning Rate Can Converge to Local Maxima , 2021, 2107.11774.

[9]  Takashi Mori,et al.  Power-Law Escape Rate of SGD , 2021, ICML.

[10]  Bo Liu,et al.  Spurious Local Minima Are Common for Deep Neural Networks with Piecewise Linear Activations , 2021, IEEE transactions on neural networks and learning systems.

[11]  Liu Ziyin,et al.  Noise and Fluctuation of Finite Learning Rate Stochastic Gradient Descent , 2020, ICML.

[12]  Daniel L. K. Yamins,et al.  Pruning neural networks without any data by iteratively conserving synaptic flow , 2020, NeurIPS.

[13]  Dacheng Tao,et al.  Piecewise linear activations substantially shape the loss surfaces of neural networks , 2020, ICLR.

[14]  Nathan Srebro,et al.  Dropout: Explicit Forms and Capacity Control , 2020, ICML.

[15]  Mohammad Norouzi,et al.  Don't Blame the ELBO! A Linear VAE Perspective on Posterior Collapse , 2019, NeurIPS.

[16]  Joan Bruna,et al.  Spurious Valleys in One-hidden-layer Neural Network Optimization Landscapes , 2019, J. Mach. Learn. Res..

[17]  Raman Arora,et al.  On Dropout and Nuclear Norm Regularization , 2019, ICML.

[18]  Andrea Montanari,et al.  Surprises in High-Dimensional Ridgeless Least Squares Interpolation , 2019, Annals of statistics.

[19]  D. Wipf,et al.  Diagnosing and Enhancing VAE Models , 2019, ICLR.

[20]  Tingting Tang,et al.  The Loss Surface of Deep Linear Networks Viewed Through the Algebraic Geometry Lens , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[21]  Richard Socher,et al.  A Closer Look at Deep Learning Heuristics: Learning rate restarts, Warmup and Distillation , 2018, ICLR.

[22]  Yuanzhi Li,et al.  An Alternative View: When Does SGD Escape Local Minima? , 2018, ICML.

[23]  Suvrit Sra,et al.  Small nonlinearities in activation functions create bad local minima in neural networks , 2018, ICLR.

[24]  Ohad Shamir,et al.  Spurious Local Minima are Common in Two-Layer ReLU Neural Networks , 2017, ICML.

[25]  Thomas Laurent,et al.  Deep Linear Networks with Arbitrary Loss: All Local Minima Are Global , 2017, ICML.

[26]  Frank Hutter,et al.  Decoupled Weight Decay Regularization , 2017, ICLR.

[27]  Alexander A. Alemi,et al.  Fixing a Broken ELBO , 2017, ICML.

[28]  B. Haeffele,et al.  Dropout as a Low-Rank Regularizer for Matrix Factorization , 2017, AISTATS.

[29]  Amirhossein Taghvaei,et al.  How regularization affects the critical points in linear networks , 2017, NIPS.

[30]  Kenji Kawaguchi,et al.  Depth Creates No Bad Local Minima , 2017, arXiv.org.

[31]  Tengyu Ma,et al.  Identity Matters in Deep Learning , 2016, ICLR.

[32]  Kenji Kawaguchi,et al.  Deep Learning without Poor Local Minima , 2016, NIPS.

[33]  Yann LeCun,et al.  Open Problem: The landscape of the loss surfaces of multilayer networks , 2015, COLT.

[34]  Zoubin Ghahramani,et al.  Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning , 2015, ICML.

[35]  Yann LeCun,et al.  The Loss Surfaces of Multilayer Networks , 2014, AISTATS.

[36]  Aaron C. Courville,et al.  Generative Adversarial Nets , 2014, NIPS.

[37]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[38]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[39]  Eugene Wong,et al.  Stochastic neural networks , 2009, Algorithmica.

[40]  Anders Krogh,et al.  A Simple Weight Decay Can Improve Generalization , 1991, NIPS.

[41]  E. Fama EFFICIENT CAPITAL MARKETS: A REVIEW OF THEORY AND EMPIRICAL WORK* , 1970 .

[42]  Zihao Wang,et al.  Sparsity by Redundancy: Solving L1 with a Simple Reparametrization , 2022, ArXiv.

[43]  Yuandong Tian Deep Contrastive Learning is Provably (almost) Principal Component Analysis , 2022, ArXiv.

[44]  James B. Simon,et al.  SGD Can Converge to Local Maxima , 2022, ICLR.

[45]  Liu Ziyin,et al.  SGD May Never Escape Saddle Points , 2021, ArXiv.

[46]  Mengjia Xu,et al.  Dynamics and Neural Collapse in Deep Classifiers trained with the Square Loss , 2021 .

[47]  Quoc V. Le,et al.  Searching for Activation Functions , 2018, arXiv.

[48]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[49]  Peter Cheeseman,et al.  Bayesian Methods for Adaptive Models , 2011 .

[50]  Kurt Hornik,et al.  Neural networks and principal component analysis: Learning from examples without local minima , 1989, Neural Networks.