A mean-field optimal control formulation of deep learning

Recent work linking deep neural networks and dynamical systems opened up new avenues to analyze deep learning. In particular, it is observed that new insights can be obtained by recasting deep learning as an optimal control problem on difference or differential equations. However, the mathematical aspects of such a formulation have not been systematically explored. This paper introduces the mathematical formulation of the population risk minimization problem in deep learning as a mean-field optimal control problem. Mirroring the development of classical optimal control, we state and prove optimality conditions of both the Hamilton–Jacobi–Bellman type and the Pontryagin type. These mean-field results reflect the probabilistic nature of the learning problem. In addition, by appealing to the mean-field Pontryagin’s maximum principle, we establish some quantitative relationships between population and empirical learning problems. This serves to establish a mathematical foundation for investigating the algorithmic and theoretical connections between optimal control and deep learning.

[1]  Boualem Djehiche,et al.  A General Stochastic Maximum Principle for SDEs of Mean-field Type , 2011 .

[2]  Peter E. Caines,et al.  Large population stochastic dynamic games: closed-loop McKean-Vlasov systems and the Nash certainty equivalence principle , 2006, Commun. Inf. Syst..

[3]  Zhen Li,et al.  Deep Residual Learning and PDEs on Manifold , 2017, ArXiv.

[4]  Pietro Perona,et al.  One-shot learning of object categories , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[5]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[6]  Yoshua. Bengio,et al.  Learning Deep Architectures for AI , 2007, Found. Trends Mach. Learn..

[7]  P. Lions,et al.  Hamilton-Jacobi equations in infinite dimensions I. Uniqueness of viscosity solutions , 1985 .

[8]  Gintare Karolina Dziugaite,et al.  Computing Nonvacuous Generalization Bounds for Deep (Stochastic) Neural Networks with Many More Parameters than Training Data , 2017, UAI.

[9]  Ping Li,et al.  A Tight Bound of Hard Thresholding , 2016, J. Mach. Learn. Res..

[10]  C. Stegall Optimization of functions on certain subsets of Banach spaces , 1978 .

[11]  Allan Peterson,et al.  The Theory of Differential Equations: Classical and Qualitative , 2003 .

[12]  Richard Bellman,et al.  Introduction to the mathematical theory of control processes , 1967 .

[13]  Eldad Haber,et al.  Stable architectures for deep neural networks , 2017, ArXiv.

[14]  E Weinan,et al.  A Proposal on Machine Learning via Dynamical Systems , 2017, Communications in Mathematics and Statistics.

[15]  Massimo Fornasier,et al.  Mean-Field Pontryagin Maximum Principle , 2015, J. Optim. Theory Appl..

[16]  Bin Dong,et al.  Beyond Finite Layer Neural Networks: Bridging Deep Architectures and Numerical Differential Equations , 2017, ICML.

[17]  Sean R Eddy,et al.  What is dynamic programming? , 2004, Nature Biotechnology.

[18]  P. Lions,et al.  Hamilton-Jacobi equations in infinite dimensions. II. Existence of viscosity solutions , 1986 .

[19]  Yoshua Bengio,et al.  Residual Connections Encourage Iterative Inference , 2017, ICLR.

[20]  N. Murata,et al.  Double Continuum Limit of Deep Neural Networks , 2017 .

[21]  Eldad Haber,et al.  Reversible Architectures for Arbitrarily Deep Residual Neural Networks , 2017, AAAI.

[22]  Nathan Srebro,et al.  Exploring Generalization in Deep Learning , 2017, NIPS.

[23]  Olivier Guéant,et al.  Mean Field Games and Applications , 2011 .

[24]  P. Lions,et al.  User’s guide to viscosity solutions of second order partial differential equations , 1992, math/9207212.

[25]  Qiang Yang,et al.  A Survey on Transfer Learning , 2010, IEEE Transactions on Knowledge and Data Engineering.

[26]  A. Sznitman Topics in propagation of chaos , 1991 .

[27]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  I. Pinelis,et al.  Remarks on Inequalities for Large Deviation Probabilities , 1986 .

[29]  S. Kahne,et al.  Optimal control: An introduction to the theory and ITs applications , 1967, IEEE Transactions on Automatic Control.

[30]  Qianxiao Li,et al.  An Optimal Control Approach to Deep Learning and Applications to Discrete-Weight Neural Networks , 2018, ICML.

[31]  O. Pironneau,et al.  Dynamic programming for mean-field type control , 2014 .

[32]  H. Keller,et al.  Approximation methods for nonlinear problems with application to two-point boundary value problems , 1975 .

[33]  H. Pham,et al.  Bellman equation and viscosity solutions for mean-field stochastic control problem , 2015, 1512.07866.

[34]  M. Fornasier,et al.  Mean-Field Optimal Control , 2013, 1306.5913.

[35]  P. Cardaliaguet,et al.  Mean Field Games , 2020, Lecture Notes in Mathematics.

[36]  Yi Zhang,et al.  Stronger generalization bounds for deep nets via a compression approach , 2018, ICML.

[37]  George M. Siouris,et al.  Applied Optimal Control: Optimization, Estimation, and Control , 1979, IEEE Transactions on Systems, Man, and Cybernetics.

[38]  W. D. Evans,et al.  PARTIAL DIFFERENTIAL EQUATIONS , 1941 .

[39]  R. Newcomb VISCOSITY SOLUTIONS OF HAMILTON-JACOBI EQUATIONS , 2010 .

[40]  Frederick Tung,et al.  Multi-level Residual Networks from Dynamical Systems View , 2017, ICLR.

[41]  Wilfrid Gangbo,et al.  Existence of a solution to an equation arising from the theory of Mean Field Games , 2015 .

[42]  N. Subbotina The method of characteristics for Hamilton—Jacobi equations and applications to dynamical optimization , 2006 .

[43]  Massimo Fornasier,et al.  Sparse Stabilization and Control of Alignment Models , 2012, 1210.5739.

[44]  L. S. Pontryagin,et al.  Mathematical Theory of Optimal Processes , 1962 .

[45]  David Duvenaud,et al.  Neural Ordinary Differential Equations , 2018, NeurIPS.

[46]  Daniel Liberzon,et al.  Calculus of Variations and Optimal Control Theory: A Concise Introduction , 2012 .

[47]  R. V. Gamkrelidze,et al.  THE THEORY OF OPTIMAL PROCESSES. I. THE MAXIMUM PRINCIPLE , 1960 .

[48]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[49]  Geoffrey E. Hinton,et al.  Deep Learning , 2015, Nature.

[50]  A. Bensoussan,et al.  Mean Field Games and Mean Field Type Control Theory , 2013 .

[51]  P. Lions,et al.  Hamilton-Jacobi equations in infinite dimensions, III , 1986 .

[52]  Long Chen,et al.  Maximum Principle Based Algorithms for Deep Learning , 2017, J. Mach. Learn. Res..

[53]  Daniel Andersson,et al.  A Maximum Principle for SDEs of Mean-Field Type , 2011 .

[54]  Serge J. Belongie,et al.  Residual Networks Behave Like Ensembles of Relatively Shallow Networks , 2016, NIPS.

[55]  Yoshua Bengio,et al.  Domain Adaptation for Large-Scale Sentiment Classification: A Deep Learning Approach , 2011, ICML.

[56]  R. Carmona,et al.  Forward-Backward Stochastic Differential Equations and Controlled McKean Vlasov Dynamics , 2013, 1303.5835.

[57]  P. Lions,et al.  Mean field games , 2007 .

[58]  Huyên Pham,et al.  Dynamic Programming for Optimal Control of Stochastic McKean-Vlasov Dynamics , 2016, SIAM J. Control. Optim..

[59]  Allan Peterson,et al.  The Theory of Differential Equations , 2010 .

[60]  Yann LeCun,et al.  A theoretical framework for back-propagation , 1988 .