Large-time asymptotics in deep learning

It is by now well-known that practical deep supervised learning may roughly be cast as an optimal control problem for a specific discrete-time, nonlinear dynamical system called an artificial neural network. In this work, we consider the continuous-time formulation of the deep supervised learning problem, and study the latter's behavior when the final time horizon increases, a fact that can be interpreted as increasing the number of layers in the neural network setting. When considering the classical regularized empirical risk minimization problem , we show that, in long time, the optimal states converge to zero training error, namely approach the zero training error regime, whilst the optimal control parameters approach, on an appropriate scale, minimal norm parameters with corresponding states precisely in the zero training error regime. This result provides an alternative theoretical underpinning to the notion that neural networks learn best in the overparametrized regime, when seen from the large layer perspective. We also propose a learning problem consisting of minimizing a cost with a state tracking term, and establish the well-known turnpike property, which indicates that the solutions of the learning problem in long time intervals consist of three pieces, the first and the last of which being transient short-time arcs, and the middle piece being a long-time arc staying exponentially close to the optimal solution of an associated static learning problem. This property in fact stipulates a quantitative estimate for the number of layers required to reach the zero training error regime. Both of the aforementioned asymptotic regimes are addressed in the context of continuous-time and continuous space-time neural networks, the latter taking the form of nonlinear, integro-differential equations, hence covering residual neural networks with both fixed and possibly variable depths.

[1]  David Duvenaud,et al.  FFJORD: Free-form Continuous Dynamics for Scalable Reversible Generative Models , 2018, ICLR.

[2]  A. Barron,et al.  Approximation and learning by greedy algorithms , 2008, 0803.1718.

[3]  A. Haurie,et al.  Infinite horizon optimal control : deterministic and stochastic systems , 1991 .

[4]  Ambuj Tewari,et al.  On the Complexity of Linear Prediction: Risk Bounds, Margin Bounds, and Regularization , 2008, NIPS.

[5]  Lars Grüne,et al.  On the relation between strict dissipativity and turnpike properties , 2016, Syst. Control. Lett..

[6]  Richard J. Mammone,et al.  Artificial neural networks for speech and vision , 1994 .

[7]  Yee Whye Teh,et al.  Augmented Neural ODEs , 2019, NeurIPS.

[8]  Marc'Aurelio Ranzato,et al.  Efficient Learning of Sparse Representations with an Energy-Based Model , 2006, NIPS.

[9]  Eduardo D. Sontag,et al.  Complete controllability of continuous-time recurrent neural networks , 1997 .

[10]  Allan Pinkus,et al.  Approximation theory of the MLP model in neural networks , 1999, Acta Numerica.

[11]  Long Chen,et al.  Maximum Principle Based Algorithms for Deep Learning , 2017, J. Mach. Learn. Res..

[12]  Wolfgang Dahmen,et al.  Convergence Rates for Greedy Algorithms in Reduced Basis Methods , 2010, SIAM J. Math. Anal..

[13]  Emmanuel Trélat,et al.  Global Steady-State Controllability of One-Dimensional Semilinear Heat Equations , 2004, SIAM J. Control. Optim..

[14]  Dong Yu,et al.  Sequential Labeling Using Deep-Structured Conditional Random Fields , 2010, IEEE Journal of Selected Topics in Signal Processing.

[15]  Lars Grüne,et al.  Exponential sensitivity and turnpike analysis for linear quadratic optimal control of general evolution equations , 2020 .

[16]  Brendan J. Frey,et al.  Deep learning of the tissue-regulated splicing code , 2014, Bioinform..

[17]  Kaj Nyström,et al.  Neural ODEs as the Deep Limit of ResNets with constant weights , 2019, Analysis and Applications.

[18]  Maxim Raginsky,et al.  Neural Stochastic Differential Equations: Deep Latent Gaussian Models in the Diffusion Limit , 2019, ArXiv.

[19]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  J. Lions Exact controllability, stabilization and perturbations for distributed systems , 1988 .

[21]  Jorge Nocedal,et al.  Optimization Methods for Large-Scale Machine Learning , 2016, SIAM Rev..

[22]  Stéphane Mallat,et al.  Understanding deep convolutional networks , 2016, Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences.

[23]  Martin Burger,et al.  Error Bounds for Approximation with Neural Networks , 2001, J. Approx. Theory.

[24]  Helmut Bölcskei,et al.  Optimal Approximation with Sparsely Connected Deep Neural Networks , 2017, SIAM J. Math. Data Sci..

[25]  P. Kokotovic,et al.  A dichotomy in linear control theory , 1972 .

[26]  A. Haurie Optimal control on an infinite time horizon: The turnpike approach , 1976 .

[27]  Albert Cohen,et al.  Approximation of high-dimensional parametric PDEs * , 2015, Acta Numerica.

[28]  Enrique Zuazua,et al.  Control under constraints for multi-dimensional reaction-diffusion monostable and bistable equations , 2019, 1912.13066.

[29]  Enrique Zuazua,et al.  Propagation, Observation, and Control of Waves Approximated by Finite Difference Methods , 2005, SIAM Rev..

[30]  Nathan Srebro,et al.  The Implicit Bias of Gradient Descent on Separable Data , 2017, J. Mach. Learn. Res..

[31]  Andrea Montanari,et al.  The Interpolation Phase Transition in Neural Networks: Memorization and Generalization under Lazy Training , 2020, ArXiv.

[32]  Levon Nurbekyan,et al.  A machine learning framework for solving high-dimensional mean field game and mean field control problems , 2020, Proceedings of the National Academy of Sciences.

[33]  Bin Dong,et al.  Beyond Finite Layer Neural Networks: Bridging Deep Architectures and Numerical Differential Equations , 2017, ICML.

[34]  A. Zaslavski Turnpike properties in the calculus of variations and optimal control , 2005 .

[35]  Bahman Gharesifard,et al.  Universal Approximation Power of Deep Neural Networks via Nonlinear Control Theory , 2020, ArXiv.

[36]  E Weinan,et al.  A mean-field optimal control formulation of deep learning , 2018, Research in the Mathematical Sciences.

[37]  Roi Livni,et al.  On the Computational Efficiency of Training Neural Networks , 2014, NIPS.

[38]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[39]  Mark Steedman,et al.  On “The Computation” , 2007 .

[40]  Pierre Cardaliaguet,et al.  Long time behavior of the master equation in mean field game theory , 2017, Analysis & PDE.

[42]  Patrick Kidger,et al.  Universal Approximation with Deep Narrow Networks , 2019, COLT 2019.

[43]  Francis Bach,et al.  On the Global Convergence of Gradient Descent for Over-parameterized Models using Optimal Transport , 2018, NeurIPS.

[44]  Geoffrey E. Hinton,et al.  Deep Learning , 2015, Nature.

[45]  Suvrit Sra,et al.  Small ReLU networks are powerful memorizers: a tight analysis of memorization capacity , 2018, NeurIPS.

[46]  Carola-Bibiane Schönlieb,et al.  Deep learning as optimal control problems: models and numerical methods , 2019, Journal of Computational Dynamics.

[47]  George Cybenko,et al.  Approximation by superpositions of a sigmoidal function , 1989, Math. Control. Signals Syst..

[48]  Dario Pighin,et al.  The turnpike with lack of observability , 2020, 2007.14081.

[49]  David Duvenaud,et al.  Residual Flows for Invertible Generative Modeling , 2019, NeurIPS.

[50]  Enrique Zuazua,et al.  Wave Propagation, Observation and Control in 1-d Flexible Multi-Structures (Mathématiques et Applications) , 2005 .

[51]  M. Thorpe,et al.  Deep limits of residual neural networks , 2018, Research in the Mathematical Sciences.

[52]  H. Brezis Functional Analysis, Sobolev Spaces and Partial Differential Equations , 2010 .

[53]  D. Kumar OPTIMIZATION METHODS , 2007 .

[54]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[55]  Cong Ma,et al.  A Selective Overview of Deep Learning , 2019, Statistical science : a review journal of the Institute of Mathematical Statistics.

[56]  J. Farrell,et al.  Qualitative analysis of neural networks , 1989 .

[57]  Carola-Bibiane Schönlieb,et al.  Structure-preserving deep learning , 2020, European Journal of Applied Mathematics.

[58]  David Duvenaud,et al.  Neural Ordinary Differential Equations , 2018, NeurIPS.

[59]  Jianfeng Lu,et al.  A Mean-field Analysis of Deep ResNet and Beyond: Towards Provable Optimization Via Overparameterization From Depth , 2020, ICML.

[60]  Lawrence D. Jackel,et al.  Handwritten Digit Recognition with a Back-Propagation Network , 1989, NIPS.

[61]  Anna Kazeykina,et al.  Game on Random Environement, Mean-field Langevin System and Neural Networks , 2020 .

[62]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[63]  E Weinan,et al.  A Proposal on Machine Learning via Dynamical Systems , 2017, Communications in Mathematics and Statistics.

[64]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[65]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[66]  Carl Graham Long‐time behavior , 2014 .

[67]  Benjamin Recht,et al.  A Tour of Reinforcement Learning: The View from Continuous Control , 2018, Annu. Rev. Control. Robotics Auton. Syst..

[68]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[69]  Francis Bach,et al.  Implicit Bias of Gradient Descent for Wide Two-layer Neural Networks Trained with the Logistic Loss , 2020, COLT.

[70]  Eduardo D. Sontag,et al.  Further results on controllability of recurrent neural networks 1 1 Supported in part by US Air Forc , 1999 .

[71]  Enrique Zuazua,et al.  Controllability under positivity constraints of semilinear heat equations , 2017, Mathematical Control & Related Fields.

[72]  T. Poggio,et al.  General conditions for predictivity in learning theory , 2004, Nature.

[73]  D. Bertsekas Reinforcement Learning and Optimal ControlA Selective Overview , 2018 .

[74]  E. Zuazua,et al.  The turnpike property and the long-time behavior of the Hamilton-Jacobi equation , 2020 .

[75]  Anna Kazeykina,et al.  Mean-field Langevin System, Optimal Control and Deep Neural Networks , 2019, ArXiv.

[76]  Linan Zhang,et al.  Forward Stability of ResNet and Its Variants , 2018, Journal of Mathematical Imaging and Vision.

[77]  Enrique Zuazua,et al.  From averaged to simultaneous controllability , 2016 .

[78]  Yann Le Cun,et al.  A Theoretical Framework for Back-Propagation , 1988 .

[79]  Enrique Zuazua,et al.  Long Time versus Steady State Optimal Control , 2013, SIAM J. Control. Optim..

[80]  Enrique Zuazua,et al.  Remarks on Long Time Versus Steady State Optimal Control , 2016 .

[81]  Matthew Kelly,et al.  An Introduction to Trajectory Optimization: How to Do Your Own Direct Collocation , 2017, SIAM Rev..

[82]  David K. Smith,et al.  Dynamic Programming and Optimal Control. Volume 1 , 1996 .

[83]  Eldad Haber,et al.  Stable architectures for deep neural networks , 2017, ArXiv.

[84]  J. Neumann A Model of General Economic Equilibrium , 1945 .

[85]  Christa Cuchiero,et al.  Deep Neural Networks, Generic Universal Interpolation, and Controlled ODEs , 2019, SIAM J. Math. Data Sci..

[86]  Lars Grüne,et al.  Turnpike Properties and Strict Dissipativity for Discrete Time Linear Quadratic Optimal Control Problems , 2018, SIAM J. Control. Optim..

[87]  Morris W. Hirsch,et al.  Convergent activation dynamics in continuous time networks , 1989, Neural Networks.

[88]  Peter L. Bartlett,et al.  Rademacher and Gaussian Complexities: Risk Bounds and Structural Results , 2003, J. Mach. Learn. Res..

[89]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[90]  Pierre-Louis Lions,et al.  Long Time Average of Mean Field Games with a Nonlocal Coupling , 2013, SIAM J. Control. Optim..

[91]  Jonathon Shlens,et al.  Explaining and Harnessing Adversarial Examples , 2014, ICLR.

[92]  Alain Rapaport,et al.  Competition between Most Rapid Approach Paths: Necessary and Sufficient Conditions , 2005 .

[93]  Yoshua Bengio,et al.  Greedy Layer-Wise Training of Deep Networks , 2006, NIPS.

[94]  L. Szpruch,et al.  Mean-Field Neural ODEs via Relaxed Optimal Control , 2019, 1912.05475.

[95]  P. Markowich,et al.  Selection dynamics for deep neural networks , 2019, 1905.09076.

[96]  H. J. G. Debets Optimization Methods For Hplc , 1985 .

[97]  Samy Bengio,et al.  Understanding deep learning requires rethinking generalization , 2016, ICLR.

[98]  Sayan Mukherjee,et al.  Learning theory: stability is sufficient for generalization and necessary and sufficient for consistency of empirical risk minimization , 2006, Adv. Comput. Math..

[99]  A. Ng Feature selection, L1 vs. L2 regularization, and rotational invariance , 2004, Twenty-first international conference on Machine learning - ICML '04.

[100]  Eduardo Sontag,et al.  Discrete-Time Transitivity and Accessibility: Analytic Systems , 1993 .

[101]  André Elisseeff,et al.  Stability and Generalization , 2002, J. Mach. Learn. Res..

[102]  Can Zhang,et al.  Steady-State and Periodic Exponential Turnpike Property for Optimal Control Problems in Hilbert Spaces , 2016, SIAM J. Control. Optim..

[103]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[104]  Stéphane Mallat,et al.  Group Invariant Scattering , 2011, ArXiv.

[105]  L. McKenzie,et al.  TURNPIKE THEOREMS FOR A GENERALIZED LEONTIEF MODELl , 1963 .

[106]  Lei Wu,et al.  Machine Learning from a Continuous Viewpoint , 2019, ArXiv.

[107]  Pineda,et al.  Generalization of back-propagation to recurrent neural networks. , 1987, Physical review letters.

[108]  Stéphane Mallat,et al.  Invariant Scattering Convolution Networks , 2012, IEEE transactions on pattern analysis and machine intelligence.

[109]  Kurt Hornik,et al.  Multilayer feedforward networks are universal approximators , 1989, Neural Networks.

[110]  P. Cartigny,et al.  Turnpike theorems by a value function approach , 2003, 2003 European Control Conference (ECC).

[111]  David Duvenaud,et al.  Latent Ordinary Differential Equations for Irregularly-Sampled Time Series , 2019, NeurIPS.

[112]  L. McKenzie,et al.  Turnpike Theory , 1976 .

[113]  Nathan Srebro,et al.  Implicit Bias of Gradient Descent on Linear Convolutional Networks , 2018, NeurIPS.

[114]  G. Petrova,et al.  Nonlinear Approximation and (Deep) ReLU\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathrm {ReLU}$$\end{document} , 2019, Constructive Approximation.

[115]  Eldad Haber,et al.  Deep Neural Networks Motivated by Partial Differential Equations , 2018, Journal of Mathematical Imaging and Vision.

[116]  Enrique Zuazua,et al.  The turnpike property in finite-dimensional nonlinear optimal control , 2014, 1402.3263.

[117]  Lei Li,et al.  Random Batch Methods (RBM) for interacting particle systems , 2018, J. Comput. Phys..

[118]  Jason Yosinski,et al.  Hamiltonian Neural Networks , 2019, NeurIPS.