Plateau Phenomenon in Gradient Descent Training of ReLU networks: Explanation, Quantification and Avoidance

The ability of neural networks to provide `best in class' approximation across a wide range of applications is well-documented. Nevertheless, the powerful expressivity of neural networks comes to naught if one is unable to effectively train (choose) the parameters defining the network. In general, neural networks are trained by gradient descent type optimization methods, or a stochastic variant thereof. In practice, such methods result in the loss function decreases rapidly at the beginning of training but then, after a relatively small number of steps, significantly slow down. The loss may even appear to stagnate over the period of a large number of epochs, only to then suddenly start to decrease fast again for no apparent reason. This so-called plateau phenomenon manifests itself in many learning tasks. The present work aims to identify and quantify the root causes of plateau phenomenon. No assumptions are made on the number of neurons relative to the number of training data, and our results hold for both the lazy and adaptive regimes. The main findings are: plateaux correspond to periods during which activation patterns remain constant, where activation pattern refers to the number of data points that activate a given neuron; quantification of convergence of the gradient flow dynamics; and, characterization of stationary points in terms solutions of local least squares regression lines over subsets of the training data. Based on these conclusions, we propose a new iterative training method, the Active Neuron Least Squares (ANLS), characterised by the explicit adjustment of the activation pattern at each step, which is designed to enable a quick exit from a plateau. Illustrative numerical examples are included throughout.

[1]  Sebastian Ruder,et al.  An overview of gradient descent optimization algorithms , 2016, Vestnik komp'iuternykh i informatsionnykh tekhnologii.

[2]  Dmitry Yarotsky,et al.  Error bounds for approximations with deep ReLU networks , 2016, Neural Networks.

[3]  Barnabás Póczos,et al.  Gradient Descent Provably Optimizes Over-parameterized Neural Networks , 2018, ICLR.

[4]  Geoffrey E. Hinton,et al.  Deep Learning , 2015, Nature.

[5]  D. Jupp Approximation to Data by Splines with Free Knots , 1978 .

[6]  Yuan Yang,et al.  Influence Area of Overlap Singularity in Multilayer Perceptrons , 2018, IEEE Access.

[7]  Shun-ichi Amari,et al.  Dynamics of Learning in Multilayer Perceptrons Near Singularities , 2008, IEEE Transactions on Neural Networks.

[8]  Andrea Montanari,et al.  A mean field view of the landscape of two-layer neural networks , 2018, Proceedings of the National Academy of Sciences.

[9]  Saad,et al.  On-line learning in soft committee machines. , 1995, Physical review. E, Statistical physics, plasmas, fluids, and related interdisciplinary topics.

[10]  Jian Sun,et al.  Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[11]  George Cybenko,et al.  Approximation by superpositions of a sigmoidal function , 1989, Math. Control. Signals Syst..

[12]  Andrew R. Barron,et al.  Universal approximation bounds for superpositions of a sigmoidal function , 1993, IEEE Trans. Inf. Theory.

[13]  J. Aubin,et al.  Differential inclusions set-valued maps and viability theory , 1984 .

[14]  Arthur Jacot,et al.  Neural tangent kernel: convergence and generalization in neural networks (invited paper) , 2018, NeurIPS.

[15]  Justin A. Sirignano,et al.  Mean field analysis of neural networks: A central limit theorem , 2018, Stochastic Processes and their Applications.

[16]  Yuan Cao,et al.  Stochastic Gradient Descent Optimizes Over-parameterized Deep ReLU Networks , 2018, ArXiv.

[17]  Aleksej F. Filippov,et al.  Differential Equations with Discontinuous Righthand Sides , 1988, Mathematics and Its Applications.

[18]  Grant M. Rotskoff,et al.  Neural Networks as Interacting Particle Systems: Asymptotic Convexity of the Loss Landscape and Universal Scaling of the Approximation Error , 2018, ArXiv.

[19]  Gene H. Golub,et al.  Matrix computations , 1983 .

[20]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[21]  Mauro Perego,et al.  Robust Training and Initialization of Deep Neural Networks: An Adaptive Basis Viewpoint , 2019, MSML.

[22]  Philipp Petersen,et al.  Optimal approximation of piecewise smooth functions using deep ReLU neural networks , 2017, Neural Networks.

[23]  Juncai He sci Relu Deep Neural Networks and Linear Finite Elements , 2020 .

[24]  Tsuyoshi Murata,et al.  {m , 1934, ACML.

[25]  Yuki Yoshida,et al.  Data-dependence of plateau phenomenon in learning with neural network—statistical mechanical analysis , 2020, NeurIPS.

[26]  Andrea Montanari,et al.  Limitations of Lazy Training of Two-layers Neural Networks , 2019, NeurIPS.

[27]  Yuanzhi Li,et al.  A Convergence Theory for Deep Learning via Over-Parameterization , 2018, ICML.

[28]  Grgoire Montavon,et al.  Neural Networks: Tricks of the Trade , 2012, Lecture Notes in Computer Science.

[29]  Liwei Wang,et al.  Gradient Descent Finds Global Minima of Deep Neural Networks , 2018, ICML.

[30]  H. N. Mhaskar,et al.  Neural Networks for Optimal Approximation of Smooth and Analytic Functions , 1996, Neural Computation.

[31]  G. Teschl Ordinary Differential Equations and Dynamical Systems , 2012 .

[32]  P. Alam ‘L’ , 2021, Composites Engineering: An A–Z Guide.

[33]  Jaehoon Lee,et al.  Wide neural networks of any depth evolve as linear models under gradient descent , 2019, NeurIPS.

[34]  Francis Bach,et al.  On Lazy Training in Differentiable Programming , 2018, NeurIPS.

[35]  Allan Pinkus,et al.  Approximation theory of the MLP model in neural networks , 1999, Acta Numerica.

[36]  Joan Bruna,et al.  Gradient Dynamics of Shallow Univariate ReLU Networks , 2019, NeurIPS.

[37]  Shun-ichi Amari,et al.  Dynamics of Learning Near Singularities in Layered Networks , 2008, Neural Computation.

[38]  Samet Oymak,et al.  Toward Moderate Overparameterization: Global Convergence Guarantees for Training Shallow Neural Networks , 2019, IEEE Journal on Selected Areas in Information Theory.

[39]  Kenji Fukumizu,et al.  Local minima and plateaus in hierarchical structures of multilayer perceptrons , 2000, Neural Networks.

[40]  Kenji Fukumizu,et al.  Adaptive natural gradient learning algorithms for various stochastic models , 2000, Neural Networks.

[41]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[42]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.