Asymptotics of Reinforcement Learning with Neural Networks

We prove that a single-layer neural network trained with the Q-learning algorithm converges in distribution to a random ordinary differential equation as the size of the model and the number of training steps become large. Analysis of the limit differential equation shows that it has a unique stationary solution that is the solution of the Bellman equation, thus giving the optimal control for the problem. In addition, we study the convergence of the limit differential equation to the stationary solution. As a by-product of our analysis, we obtain the limiting behavior of single-layer neural networks when trained on independent and identically distributed data with stochastic gradient descent under the widely used Xavier initialization.

[1]  Marc Peter Deisenroth,et al.  Deep Reinforcement Learning: A Brief Survey , 2017, IEEE Signal Processing Magazine.

[2]  P. Alam ‘S’ , 2021, Composites Engineering: An A–Z Guide.

[3]  Benjamin Van Roy,et al.  Deep Exploration via Bootstrapped DQN , 2016, NIPS.

[4]  Andrea Montanari,et al.  A mean field view of the landscape of two-layer neural networks , 2018, Proceedings of the National Academy of Sciences.

[5]  P. Alam ‘G’ , 2021, Composites Engineering: An A–Z Guide.

[6]  Tom Schaul,et al.  Dueling Network Architectures for Deep Reinforcement Learning , 2015, ICML.

[7]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[8]  Barnabás Póczos,et al.  Gradient Descent Provably Optimizes Over-parameterized Neural Networks , 2018, ICLR.

[9]  M. Manhart,et al.  Markov Processes , 2018, Introduction to Stochastic Processes and Simulation.

[10]  俊一 甘利 5分で分かる!? 有名論文ナナメ読み:Jacot, Arthor, Gabriel, Franck and Hongler, Clement : Neural Tangent Kernel : Convergence and Generalization in Neural Networks , 2020 .

[11]  P. Alam,et al.  H , 1887, High Explosives, Propellants, Pyrotechnics.

[12]  Dong Yu,et al.  Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[13]  Richard Evans,et al.  Deep Reinforcement Learning in Large Discrete Action Spaces , 2015, 1512.07679.

[14]  Peter Dayan,et al.  Q-learning , 1992, Machine Learning.

[15]  Alex Graves,et al.  Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[16]  B. Perthame Perturbed dynamical systems with an attracting singularity and weak viscosity limits in Hamilton-Jacobi equations , 1990 .

[17]  H. Kushner,et al.  Stochastic Approximation and Recursive Algorithms and Applications , 2003 .

[18]  S. Ethier,et al.  Markov Processes: Characterization and Convergence , 2005 .

[19]  Alex Graves,et al.  Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.

[20]  Yuan Cao,et al.  Stochastic Gradient Descent Optimizes Over-parameterized Deep ReLU Networks , 2018, ArXiv.

[21]  David Silver,et al.  Deep Reinforcement Learning with Double Q-Learning , 2015, AAAI.

[22]  Anil A. Bharath,et al.  Deep Reinforcement Learning: A Brief Survey , 2017, IEEE Signal Processing Magazine.

[23]  Geoffrey E. Hinton,et al.  Deep Learning , 2015, Nature.

[24]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[25]  Geoffrey E. Hinton,et al.  Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[26]  C. Watkins Learning from delayed rewards , 1989 .

[27]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[28]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[29]  Peter Stone,et al.  Deep Recurrent Q-Learning for Partially Observable MDPs , 2015, AAAI Fall Symposia.

[30]  John N. Tsitsiklis,et al.  Asynchronous Stochastic Approximation and Q-Learning , 1994, Machine Learning.

[31]  Justin A. Sirignano,et al.  Mean Field Analysis of Deep Neural Networks , 2019, Math. Oper. Res..

[32]  Konstantinos Spiliopoulos,et al.  Mean Field Analysis of Neural Networks: A Law of Large Numbers , 2018, SIAM J. Appl. Math..

[33]  V. Borkar Asynchronous Stochastic Approximations , 1998 .

[34]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[35]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[36]  Liwei Wang,et al.  Gradient Descent Finds Global Minima of Deep Neural Networks , 2018, ICML.

[37]  P. Alam ‘A’ , 2021, Composites Engineering: An A–Z Guide.

[38]  Jan Peters,et al.  Reinforcement learning in robotics: A survey , 2013, Int. J. Robotics Res..

[39]  Justin A. Sirignano,et al.  Mean field analysis of neural networks: A central limit theorem , 2018, Stochastic Processes and their Applications.

[40]  Kurt Hornik,et al.  Convergence of learning algorithms with constant learning rates , 1991, IEEE Trans. Neural Networks.

[41]  Grant M. Rotskoff,et al.  Neural Networks as Interacting Particle Systems: Asymptotic Convexity of the Loss Landscape and Universal Scaling of the Approximation Error , 2018, ArXiv.

[42]  Yann LeCun,et al.  Pedestrian Detection with Unsupervised Multi-stage Feature Learning , 2012, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[43]  Francis Bach,et al.  On the Global Convergence of Gradient Descent for Over-parameterized Models using Optimal Transport , 2018, NeurIPS.

[44]  Kurt Hornik,et al.  Approximation capabilities of multilayer feedforward networks , 1991, Neural Networks.

[45]  Demis Hassabis,et al.  Mastering the game of Go without human knowledge , 2017, Nature.

[46]  Yoshifusa Ito,et al.  Nonlinearity creates linear independence , 1996, Adv. Comput. Math..

[47]  Tsuyoshi Murata,et al.  {m , 1934, ACML.

[48]  Kurt Hornik,et al.  Multilayer feedforward networks are universal approximators , 1989, Neural Networks.