Over-parameterization as a Catalyst for Better Generalization of Deep ReLU network

To analyze deep ReLU network, we adopt a student-teacher setting in which an over-parameterized student network learns from the output of a fixed teacher network of the same depth, with Stochastic Gradient Descent (SGD). Our contributions are two-fold. First, we prove that when the gradient is zero (or bounded above by a small constant) at every data point in training, a situation called \emph{interpolation setting}, there exists many-to-one \emph{alignment} between student and teacher nodes in the lowest layer under mild conditions. This suggests that generalization in unseen dataset is achievable, even the same condition often leads to zero training error. Second, analysis of noisy recovery and training dynamics in 2-layer network shows that strong teacher nodes (with large fan-out weights) are learned first and subtle teacher nodes are left unlearned until late stage of training. As a result, it could take a long time to converge into these small-gradient critical points. Our analysis shows that over-parameterization plays two roles: (1) it is a necessary condition for alignment to happen at the critical points, and (2) in training dynamics, it helps student nodes cover more teacher nodes with fewer iterations. Both improve generalization. Experiments justify our finding. The code is released in this https URL.

[1]  Jürgen Schmidhuber,et al.  Flat Minima , 1997, Neural Computation.

[2]  Michael I. Jordan,et al.  Gradient Descent Can Take Exponential Time to Escape Saddle Points , 2017, NIPS.

[3]  Surya Ganguli,et al.  An analytic theory of generalization dynamics and transfer learning in deep linear networks , 2018, ICLR.

[4]  S. Smale On the differential equations of species in competition , 1976, Journal of mathematical biology.

[5]  David M. Blei,et al.  Stochastic Gradient Descent as Approximate Bayesian Inference , 2017, J. Mach. Learn. Res..

[6]  James A. Storer,et al.  RePr: Improved Training of Convolutional Filters , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Raef Bassily,et al.  On exponential convergence of SGD in non-convex over-parametrized learning , 2018, ArXiv.

[8]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Florent Krzakala,et al.  Dynamics of stochastic gradient descent for two-layer neural networks in the teacher–student setup , 2019, NeurIPS.

[10]  Saad,et al.  On-line learning in soft committee machines. , 1995, Physical review. E, Statistical physics, plasmas, fluids, and related interdisciplinary topics.

[11]  Yuandong Tian,et al.  Luck Matters: Understanding Training Dynamics of Deep ReLU Networks , 2019, ArXiv.

[12]  Ohad Shamir,et al.  Spurious Local Minima are Common in Two-Layer ReLU Neural Networks , 2017, ICML.

[13]  Furong Huang,et al.  Escaping From Saddle Points - Online Stochastic Gradient for Tensor Decomposition , 2015, COLT.

[14]  Yuandong Tian,et al.  One ticket to win them all: generalizing lottery ticket initializations across datasets and optimizers , 2019, NeurIPS.

[15]  John N. Tsitsiklis,et al.  Gradient Convergence in Gradient methods with Errors , 1999, SIAM J. Optim..

[16]  Tengyu Ma,et al.  Learning One-hidden-layer Neural Networks with Landscape Design , 2017, ICLR.

[17]  Jascha Sohl-Dickstein,et al.  Measuring the Effects of Data Parallelism on Neural Network Training , 2018, J. Mach. Learn. Res..

[18]  G. Yin,et al.  On competitive Lotka-Volterra model in random environments , 2009 .

[19]  Quanquan Gu,et al.  An Improved Analysis of Training Over-parameterized Deep Neural Networks , 2019, NeurIPS.

[20]  Anthony C. C. Coolen,et al.  Statistical mechanical analysis of the dynamics of learning in perceptrons , 1997, Stat. Comput..

[21]  Thomas Laurent,et al.  The Multilinear Structure of ReLU Networks , 2017, ICML.

[22]  Michael I. Jordan,et al.  How to Escape Saddle Points Efficiently , 2017, ICML.

[23]  Kenji Kawaguchi,et al.  Deep Learning without Poor Local Minima , 2016, NIPS.

[24]  Roi Livni,et al.  On the Computational Efficiency of Training Neural Networks , 2014, NIPS.

[25]  Jascha Sohl-Dickstein,et al.  SVCCA: Singular Vector Canonical Correlation Analysis for Deep Learning Dynamics and Interpretability , 2017, NIPS.

[26]  David Saad,et al.  Dynamics of On-Line Gradient Descent Learning for Multilayer Neural Networks , 1995, NIPS.

[27]  Stefano Soatto,et al.  Stochastic Gradient Descent Performs Variational Inference, Converges to Limit Cycles for Deep Networks , 2017, 2018 Information Theory and Applications Workshop (ITA).

[28]  Ruslan Salakhutdinov,et al.  Path-SGD: Path-Normalized Optimization in Deep Neural Networks , 2015, NIPS.

[29]  Michael Carbin,et al.  The Lottery Ticket Hypothesis: Training Pruned Neural Networks , 2018, ArXiv.

[30]  Suvrit Sra,et al.  Small nonlinearities in activation functions create bad local minima in neural networks , 2018, ICLR.

[31]  David Saad,et al.  Online Learning in Radial Basis Function Networks , 1997, Neural Computation.

[32]  Thomas Hofmann,et al.  Escaping Saddles with Stochastic Gradients , 2018, ICML.

[33]  Jason Yosinski,et al.  Measuring the Intrinsic Dimension of Objective Landscapes , 2018, ICLR.

[34]  Thomas Laurent,et al.  Deep Linear Networks with Arbitrary Loss: All Local Minima Are Global , 2017, ICML.

[35]  Song Han,et al.  Learning both Weights and Connections for Efficient Neural Network , 2015, NIPS.

[36]  Samy Bengio,et al.  Understanding deep learning requires rethinking generalization , 2016, ICLR.

[37]  Suvrit Sra,et al.  Global optimality conditions for deep neural networks , 2017, ICLR.

[38]  Yuan Cao,et al.  Generalization Bounds of Stochastic Gradient Descent for Wide and Deep Neural Networks , 2019, NeurIPS.

[39]  Raef Bassily,et al.  The Power of Interpolation: Understanding the Effectiveness of SGD in Modern Over-parametrized Learning , 2017, ICML.

[40]  Arthur Jacot,et al.  Neural tangent kernel: convergence and generalization in neural networks (invited paper) , 2018, NeurIPS.

[41]  Y. Hosono,et al.  The minimal speed of traveling fronts for a diffusive Lotka-Volterra competition model , 1998 .

[42]  Gregory J. Wolff,et al.  Optimal Brain Surgeon and general network pruning , 1993, IEEE International Conference on Neural Networks.

[43]  Yuanzhi Li,et al.  Learning and Generalization in Overparameterized Neural Networks, Going Beyond Two Layers , 2018, NeurIPS.

[44]  Elad Hazan,et al.  An optimal algorithm for stochastic strongly-convex optimization , 2010, 1006.2425.

[45]  Yuan Cao,et al.  Stochastic Gradient Descent Optimizes Over-parameterized Deep ReLU Networks , 2018, ArXiv.

[46]  Gintare Karolina Dziugaite,et al.  The Lottery Ticket Hypothesis at Scale , 2019, ArXiv.

[47]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[48]  Kurt Hornik,et al.  Multilayer feedforward networks are universal approximators , 1989, Neural Networks.

[49]  Mingjie Sun,et al.  Rethinking the Value of Network Pruning , 2018, ICLR.

[50]  Yuanzhi Li,et al.  A Convergence Theory for Deep Learning via Over-Parameterization , 2018, ICML.

[51]  Yann Ollivier,et al.  Natural Langevin Dynamics for Neural Networks , 2017, GSI.

[52]  Wei Hu,et al.  A Convergence Analysis of Gradient Descent for Deep Linear Neural Networks , 2018, ICLR.

[53]  Yuandong Tian,et al.  Gradient Descent Learns One-hidden-layer CNN: Don't be Afraid of Spurious Local Minima , 2017, ICML.

[54]  Sanjeev Arora,et al.  Explaining Landscape Connectivity of Low-cost Solutions for Multilayer Nets , 2019, NeurIPS.

[55]  Lei Wu How SGD Selects the Global Minima in Over-parameterized Learning : A Dynamical Stability Perspective , 2018 .

[56]  Nicolas Macris,et al.  The committee machine: computational to statistical gaps in learning a two-layers neural network , 2018, NeurIPS.

[57]  Surya Ganguli,et al.  Exact solutions to the nonlinear dynamics of learning in deep linear neural networks , 2013, ICLR.

[58]  Yuandong Tian,et al.  An Analytical Formula of Population Gradient for two-layered ReLU network and its Applications in Convergence and Critical Point Analysis , 2017, ICML.

[59]  Inderjit S. Dhillon,et al.  Recovery Guarantees for One-hidden-layer Neural Networks , 2017, ICML.

[60]  Yee Whye Teh,et al.  Bayesian Learning via Stochastic Gradient Langevin Dynamics , 2011, ICML.

[61]  Mikhail Belkin,et al.  MaSS: an Accelerated Stochastic Method for Over-parametrized Learning , 2018, ArXiv.

[62]  Jason Yosinski,et al.  Deconstructing Lottery Tickets: Zeros, Signs, and the Supermask , 2019, NeurIPS.

[63]  Matthias Hein,et al.  The Loss Surface of Deep and Wide Neural Networks , 2017, ICML.

[64]  Andrew M. Saxe,et al.  High-dimensional dynamics of generalization error in neural networks , 2017, Neural Networks.

[65]  Liwei Wang,et al.  Gradient Descent Finds Global Minima of Deep Neural Networks , 2018, ICML.

[66]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[67]  Stefano Soatto,et al.  Entropy-SGD: biasing gradient descent into wide valleys , 2016, ICLR.

[68]  Rui Peng,et al.  Network Trimming: A Data-Driven Neuron Pruning Approach towards Efficient Deep Architectures , 2016, ArXiv.

[69]  Léon Bottou,et al.  Large-Scale Machine Learning with Stochastic Gradient Descent , 2010, COMPSTAT.

[70]  Yann LeCun,et al.  Optimal Brain Damage , 1989, NIPS.

[71]  Hod Lipson,et al.  Convergent Learning: Do different neural networks learn the same representations? , 2015, FE@NIPS.

[72]  Francis Bach,et al.  On the Global Convergence of Gradient Descent for Over-parameterized Models using Optimal Transport , 2018, NeurIPS.

[73]  Yuanzhi Li,et al.  Learning Overparameterized Neural Networks via Stochastic Gradient Descent on Structured Data , 2018, NeurIPS.