A Measure Theoretical Approach to the Mean-field Maximum Principle for Training NeurODEs

In this paper we consider a measure-theoretical formulation of the training of NeurODEs in the form of a mean-field optimal control with L-regularization of the control. We derive first order optimality conditions for the NeurODE training problem in the form of a mean-field maximum principle, and show that it admits a unique control solution, which is Lipschitz continuous in time. As a consequence of this uniqueness property, the mean-field maximum principle also provides a strong quantitative generalization error for finite sample approximations. Our derivation of the mean-field maximum principle is much simpler than the ones currently available in the literature for mean-field optimal control problems, and is based on a generalized Lagrange multiplier theorem on convex sets of spaces of measures. The latter is also new, and can be considered as a result of independent interest.

[1]  Benoît Bonnet A Pontryagin Maximum Principle in Wasserstein spaces for constrained optimal control problems , 2018, ESAIM: Control, Optimisation and Calculus of Variations.

[2]  L. Weiss Introduction to the mathematical theory of control processes, Vol. I - Linear equations and quadratic criteria , 1970 .

[3]  Ruoyu Sun,et al.  Optimization for deep learning: theory and algorithms , 2019, ArXiv.

[4]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[5]  Aleksej F. Filippov,et al.  Differential Equations with Discontinuous Righthand Sides , 1988, Mathematics and Its Applications.

[6]  Lexing Ying,et al.  The phase flow method , 2006, J. Comput. Phys..

[7]  H. Frankowska,et al.  A priori estimates for operational differential inclusions , 1990 .

[8]  David Duvenaud,et al.  Neural Ordinary Differential Equations , 2018, NeurIPS.

[9]  P. Bassanini,et al.  Elliptic Partial Differential Equations of Second Order , 1997 .

[10]  Philipp Petersen,et al.  Optimal approximation of piecewise smooth functions using deep ReLU neural networks , 2017, Neural Networks.

[11]  Benoit Bonnet,et al.  On the Properties of the Value Function Associated to a Mean-Field Optimal Control Problem of Bolza Type* , 2021, 2021 60th IEEE Conference on Decision and Control (CDC).

[12]  Mathieu Lauriere,et al.  Numerical Methods for Mean Field Games and Mean Field Type Control , 2021, Proceedings of Symposia in Applied Mathematics.

[13]  Paulo Tabuada,et al.  Universal approximation power of deep residual neural networks via nonlinear control theory , 2021, ICLR.

[14]  L. Ambrosio,et al.  Functions of Bounded Variation and Free Discontinuity Problems , 2000 .

[15]  L. Ambrosio,et al.  Gradient Flows: In Metric Spaces and in the Space of Probability Measures , 2005 .

[16]  Qianxiao Li,et al.  An Optimal Control Approach to Deep Learning and Applications to Discrete-Weight Neural Networks , 2018, ICML.

[17]  Massimo Fornasier,et al.  Mean-Field Pontryagin Maximum Principle , 2015, J. Optim. Theory Appl..

[18]  Sanjeev Arora,et al.  Implicit Regularization in Deep Matrix Factorization , 2019, NeurIPS.

[19]  Kaj Nyström,et al.  Neural ODEs as the Deep Limit of ResNets with constant weights , 2019, Analysis and Applications.

[20]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  H. Frankowska,et al.  Differential inclusions in Wasserstein spaces: The Cauchy-Lipschitz framework , 2020, Journal of Differential Equations.

[22]  M. Fornasier,et al.  Spatially Inhomogeneous Evolutionary Games , 2018, Communications on Pure and Applied Mathematics.

[23]  Wei Hu,et al.  A Convergence Analysis of Gradient Descent for Deep Linear Neural Networks , 2018, ICLR.

[24]  Jos'e A. Carrillo,et al.  A well-posedness theory in measures for some kinetic models of collective motion , 2009, 0907.3901.

[25]  Francesco Rossi,et al.  The Pontryagin Maximum Principle in the Wasserstein Space , 2017, Calculus of Variations and Partial Differential Equations.

[26]  Andrei Agrachev,et al.  Control On the Manifolds Of Mappings As a Setting For Deep Learning , 2020, ArXiv.

[27]  Massimo Fornasier,et al.  Mean-field sparse optimal control , 2014, Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences.

[28]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[29]  B. Piccoli,et al.  A Wasserstein norm for signed measures, with application to nonlocal transport equation with source term , 2019, 1910.05105.

[30]  Andrei Agrachev,et al.  Control in the Spaces of Ensembles of Points , 2019, SIAM J. Control. Optim..

[31]  P. Lions,et al.  Mean field games , 2007 .

[32]  Arnulf Jentzen,et al.  Analysis of the generalization error: Empirical risk minimization over deep artificial neural networks overcomes the curse of dimensionality in the numerical approximation of Black-Scholes partial differential equations , 2018, SIAM J. Math. Data Sci..

[33]  Geoffrey E. Hinton,et al.  Learning internal representations by error propagation , 1986 .

[34]  Benoit Bonnet,et al.  Semiconcavity and Sensitivity Analysis in Mean-Field Optimal Control and Applications , 2021, Journal de Mathématiques Pures et Appliquées.

[35]  H. N. Mhaskar,et al.  Function approximation by deep networks , 2019, ArXiv.

[36]  Carola-Bibiane Schönlieb,et al.  Deep learning as optimal control problems: models and numerical methods , 2019, Journal of Computational Dynamics.

[37]  Andrea Montanari,et al.  A mean field view of the landscape of two-layer neural networks , 2018, Proceedings of the National Academy of Sciences.

[38]  Jian Sun,et al.  Identity Mappings in Deep Residual Networks , 2016, ECCV.

[39]  Martin Burger,et al.  Mean-Field Optimal Control and Optimality Conditions in the Space of Probability Measures , 2019, SIAM J. Control. Optim..

[40]  E Weinan,et al.  A mean-field optimal control formulation of deep learning , 2018, Research in the Mathematical Sciences.

[41]  E Weinan,et al.  A Proposal on Machine Learning via Dynamical Systems , 2017, Communications in Mathematics and Statistics.

[42]  Giuseppe Savaré,et al.  Lagrangian, Eulerian and Kantorovich formulations of multi-agent optimal control problems: Equivalence and Gamma-convergence , 2020, Journal of Differential Equations.

[43]  H. Rauhut,et al.  Learning deep linear neural networks: Riemannian gradient flows and convergence to global minimizers , 2019, Information and Inference: A Journal of the IMA.

[44]  Demis Hassabis,et al.  Mastering the game of Go without human knowledge , 2017, Nature.

[45]  Erich Elsen,et al.  Deep Speech: Scaling up end-to-end speech recognition , 2014, ArXiv.

[46]  Evangelos A. Theodorou,et al.  Deep Learning Theory Review: An Optimal Control and Dynamical Systems Perspective , 2019, ArXiv.

[47]  R. Carmona,et al.  Forward-Backward Stochastic Differential Equations and Controlled McKean Vlasov Dynamics , 2013, 1303.5835.

[48]  Eldad Haber,et al.  Stable architectures for deep neural networks , 2017, ArXiv.

[49]  T. Poggio,et al.  Deep vs. shallow networks : An approximation theory perspective , 2016, ArXiv.

[50]  G. Petrova,et al.  Nonlinear Approximation and (Deep) ReLU\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathrm {ReLU}$$\end{document} , 2019, Constructive Approximation.

[51]  Long Chen,et al.  Maximum Principle Based Algorithms for Deep Learning , 2017, J. Mach. Learn. Res..

[52]  Alexander Cloninger,et al.  Provable approximation properties for deep neural networks , 2015, ArXiv.

[53]  H'elene Frankowska,et al.  Necessary Optimality Conditions for Optimal Control Problems in Wasserstein Spaces , 2021, Applied Mathematics & Optimization.

[54]  Tom Fleischer,et al.  Applied Functional Analysis , 2016 .

[55]  M. Fornasier,et al.  Mean-field optimal control as Gamma-limit of finite agent controls , 2018, European Journal of Applied Mathematics.

[56]  Arnulf Jentzen,et al.  DNN Expression Rate Analysis of High-Dimensional PDEs: Application to Option Pricing , 2018, Constructive Approximation.

[57]  Inderjit S. Dhillon,et al.  Recovery Guarantees for One-hidden-layer Neural Networks , 2017, ICML.

[58]  Alexander Cloninger,et al.  ReLU nets adapt to intrinsic dimensionality beyond the target domain , 2020, ArXiv.

[59]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[60]  P. Werbos,et al.  Beyond Regression : "New Tools for Prediction and Analysis in the Behavioral Sciences , 1974 .

[61]  Yann LeCun,et al.  Une procedure d'apprentissage pour reseau a seuil asymmetrique (A learning scheme for asymmetric threshold networks) , 1985 .