Luck Matters: Understanding Training Dynamics of Deep ReLU Networks

We analyze the dynamics of training deep ReLU networks and their implications on generalization capability. Using a teacher-student setting, we discovered a novel relationship between the gradient received by hidden student nodes and the activations of teacher nodes for deep ReLU networks. With this relationship and the assumption of small overlapping teacher node activations, we prove that (1) student nodes whose weights are initialized to be close to teacher nodes converge to them at a faster rate, and (2) in over-parameterized regimes and 2-layer case, while a small set of lucky nodes do converge to the teacher nodes, the fan-out weights of other nodes converge to zero. This framework provides insight into multiple puzzling phenomena in deep learning like over-parameterization, implicit regularization, lottery tickets, etc. We verify our assumption by showing that the majority of BatchNorm biases of pre-trained VGG11/16 models are negative. Experiments on (1) random deep teacher networks with Gaussian inputs, (2) teacher network pre-trained on CIFAR-10 and (3) extensive ablation studies validate our multiple theoretical predictions.

[1]  Thomas Hofmann,et al.  Exponential convergence rates for Batch Normalization: The power of length-direction decoupling in non-convex optimization , 2018, AISTATS.

[2]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[3]  Yann LeCun,et al.  Optimal Brain Damage , 1989, NIPS.

[4]  Thomas Hofmann,et al.  Towards a Theoretical Understanding of Batch Normalization , 2018, ArXiv.

[5]  Jason Yosinski,et al.  Measuring the Intrinsic Dimension of Objective Landscapes , 2018, ICLR.

[6]  Gintare Karolina Dziugaite,et al.  Stabilizing the Lottery Ticket Hypothesis , 2019 .

[7]  Samy Bengio,et al.  Identity Crisis: Memorization and Generalization under Extreme Overparameterization , 2019, ICLR.

[8]  Samy Bengio,et al.  Understanding deep learning requires rethinking generalization , 2016, ICLR.

[9]  Nir Shavit,et al.  Deep Learning is Robust to Massive Label Noise , 2017, ArXiv.

[10]  Gintare Karolina Dziugaite,et al.  The Lottery Ticket Hypothesis at Scale , 2019, ArXiv.

[11]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[12]  Jorge Nocedal,et al.  On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima , 2016, ICLR.

[13]  Gregory J. Wolff,et al.  Optimal Brain Surgeon and general network pruning , 1993, IEEE International Conference on Neural Networks.

[14]  Tegan Maharaj,et al.  Deep Nets Don't Learn via Memorization , 2017, ICLR.

[15]  Yuandong Tian,et al.  A theoretical framework for deep locally connected ReLU network , 2018, ArXiv.

[16]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[17]  Alex Graves,et al.  Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.

[18]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[19]  Michael Carbin,et al.  The Lottery Ticket Hypothesis: Training Pruned Neural Networks , 2018, ArXiv.

[20]  Yann LeCun,et al.  Towards Understanding the Role of Over-Parametrization in Generalization of Neural Networks , 2018, ArXiv.

[21]  Jascha Sohl-Dickstein,et al.  A Mean Field Theory of Batch Normalization , 2019, ICLR.

[22]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[23]  Zachary Chase Lipton Stuck in a What? Adventures in Weight Space , 2016, ArXiv.

[24]  Song Han,et al.  Learning both Weights and Connections for Efficient Neural Network , 2015, NIPS.

[25]  Nathan Srebro,et al.  Exploring Generalization in Deep Learning , 2017, NIPS.

[26]  Yuandong Tian,et al.  Gradient Descent Learns One-hidden-layer CNN: Don't be Afraid of Spurious Local Minima , 2017, ICML.

[27]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Razvan Pascanu,et al.  Sharp Minima Can Generalize For Deep Nets , 2017, ICML.

[29]  Zhanxing Zhu,et al.  Towards Understanding Generalization of Deep Learning: Perspective of Loss Landscapes , 2017, ArXiv.

[30]  Hao Li,et al.  Visualizing the Loss Landscape of Neural Nets , 2017, NeurIPS.

[31]  Yuandong Tian,et al.  Better Computer Go Player with Neural Network and Long-term Prediction , 2016, ICLR.

[32]  Jason Yosinski,et al.  Deconstructing Lottery Tickets: Zeros, Signs, and the Supermask , 2019, NeurIPS.

[33]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[34]  Sanjeev Arora,et al.  Theoretical Analysis of Auto Rate-Tuning by Batch Normalization , 2018, ICLR.

[35]  Chong Wang,et al.  Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin , 2015, ICML.

[36]  Ryota Tomioka,et al.  In Search of the Real Inductive Bias: On the Role of Implicit Regularization in Deep Learning , 2014, ICLR.

[37]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition , 2012 .

[38]  Demis Hassabis,et al.  Mastering the game of Go without human knowledge , 2017, Nature.

[39]  Yann LeCun,et al.  Comparing dynamics: deep neural networks versus glassy systems , 2018, ICML.

[40]  Levent Sagun,et al.  A jamming transition from under- to over-parametrization affects generalization in deep learning , 2018, Journal of Physics A: Mathematical and Theoretical.

[41]  Yann Dauphin,et al.  Empirical Analysis of the Hessian of Over-Parametrized Neural Networks , 2017, ICLR.

[42]  Yann LeCun,et al.  Eigenvalues of the Hessian in Deep Learning: Singularity and Beyond , 2016, 1611.07476.