Wasserstein Flow Meets Replicator Dynamics: A Mean-Field Analysis of Representation Learning in Actor-Critic

Actor-critic (AC) algorithms, empowered by neural networks, have had significant empirical success in recent years. However, most of the existing theoretical support for AC algorithms focuses on the case of linear function approximations, or linearized neural networks, where the feature representation is fixed throughout training. Such a limitation fails to capture the key aspect of representation learning in neural AC, which is pivotal in practical problems. In this work, we take a mean-field perspective on the evolution and convergence of feature-based neural AC. Specifically, we consider a version of AC where the actor and critic are represented by overparameterized two-layer neural networks and are updated with two-timescale learning rates. The critic is updated by temporal-difference (TD) learning with a larger stepsize while the actor is updated via proximal policy optimization (PPO) with a smaller stepsize. In the continuous-time and infinite-width limiting regime, when the timescales are properly separated, we prove that neural AC finds the globally optimal policy at a sublinear rate. Additionally, we prove that the feature representation induced by the critic network is allowed to evolve within a neighborhood of the initial one.

[1]  Alessandro Lazaric,et al.  Analysis of a Classification-based Policy Iteration Algorithm , 2010, ICML.

[2]  Sajad Khodadadian,et al.  Finite Sample Analysis of Two-Time-Scale Natural Actor-Critic Algorithm , 2021 .

[3]  Adel Javanmard,et al.  Analysis of a Two-Layer Neural Network via Displacement Convexity , 2019, The Annals of Statistics.

[4]  Michael I. Jordan,et al.  Provably Efficient Reinforcement Learning with Linear Function Approximation , 2019, COLT.

[5]  Mengdi Wang,et al.  Reinforcement Leaning in Feature Space: Matrix Bandit, Kernels, and Regret Bound , 2019, ICML.

[6]  Jakub W. Pachocki,et al.  Dota 2 with Large Scale Deep Reinforcement Learning , 2019, ArXiv.

[7]  Marcin Andrychowicz,et al.  Solving Rubik's Cube with a Robot Hand , 2019, ArXiv.

[8]  Peter Dayan,et al.  Q-learning , 1992, Machine Learning.

[9]  Sham M. Kakade,et al.  A Natural Policy Gradient , 2001, NIPS.

[10]  Patrick T. Harker,et al.  Finite-dimensional variational inequality and nonlinear complementarity problems: A survey of theory, algorithms and applications , 1990, Math. Program..

[11]  H. Kushner,et al.  Stochastic Approximation and Recursive Algorithms and Applications , 2003 .

[12]  Tilman Börgers,et al.  Learning Through Reinforcement and Replicator Dynamics , 1997 .

[13]  C. Villani,et al.  Generalization of an Inequality by Talagrand and Links with the Logarithmic Sobolev Inequality , 2000 .

[14]  Tuo Zhao,et al.  On Computation and Generalization of Generative Adversarial Imitation Learning , 2020, ICLR.

[15]  Allan Pinkus,et al.  Approximation theory of the MLP model in neural networks , 1999, Acta Numerica.

[16]  K. Friedrichs The identity of weak and strong extensions of differential operators , 1944 .

[17]  俊一 甘利 5分で分かる!? 有名論文ナナメ読み:Jacot, Arthor, Gabriel, Franck and Hongler, Clement : Neural Tangent Kernel : Convergence and Generalization in Neural Networks , 2020 .

[18]  Csaba Szepesvári,et al.  Finite-Time Bounds for Fitted Value Iteration , 2008, J. Mach. Learn. Res..

[19]  Francis Bach,et al.  On the Global Convergence of Gradient Descent for Over-parameterized Models using Optimal Transport , 2018, NeurIPS.

[20]  Qi Cai,et al.  Neural Proximal/Trust Region Policy Optimization Attains Globally Optimal Policy , 2019, ArXiv.

[21]  Pierre Baldi,et al.  Solving the Rubik’s cube with deep reinforcement learning and search , 2019, Nat. Mach. Intell..

[22]  Yuan Cao,et al.  Mean-Field Analysis of Two-Layer Neural Networks: Non-Asymptotic Rates and Generalization Bounds , 2020, ArXiv.

[23]  Csaba Szepesvári,et al.  Finite time bounds for sampling based fitted value iteration , 2005, ICML.

[24]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[25]  Csaba Szepesvári,et al.  Learning Near-Optimal Policies with Bellman-Residual Minimization Based Fitted Policy Iteration and a Single Sample Path , 2006, COLT.

[26]  John Langford,et al.  Approximately Optimal Approximate Reinforcement Learning , 2002, ICML.

[27]  Hanze Dong,et al.  Over Parameterized Two-level Neural Networks Can Learn Near Optimal Feature Representations , 2019, ArXiv.

[28]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[29]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[30]  Jianfeng Lu,et al.  A Mean-field Analysis of Deep ResNet and Beyond: Towards Provable Optimization Via Overparameterization From Depth , 2020, ICML.

[31]  Andrew R. Barron,et al.  Universal approximation bounds for superpositions of a sigmoidal function , 1993, IEEE Trans. Inf. Theory.

[32]  Zhe Wang,et al.  Non-asymptotic Convergence Analysis of Two Time-scale (Natural) Actor-Critic Algorithms , 2020, ArXiv.

[33]  Daniel Hennes,et al.  Neural Replicator Dynamics: Multiagent Learning via Hedging Policy Gradients , 2020, AAMAS.

[34]  Quanquan Gu,et al.  A Finite Time Analysis of Two Time-Scale Actor Critic Methods , 2020, NeurIPS.

[35]  Jianfeng Lu,et al.  Global optimality of softmax policy gradient with single hidden layer neural networks in the mean-field regime , 2020, ICLR.

[36]  Stefan Schaal,et al.  Natural Actor-Critic , 2003, Neurocomputing.

[37]  Wojciech M. Czarnecki,et al.  Grandmaster level in StarCraft II using multi-agent reinforcement learning , 2019, Nature.

[38]  Csaba Szepesvári,et al.  Error Propagation for Approximate Policy and Value Iteration , 2010, NIPS.

[39]  Andrea Montanari,et al.  Mean-field theory of two-layers neural networks: dimension-free bounds and kernel limit , 2019, COLT.

[40]  L. Ambrosio,et al.  Gradient Flows: In Metric Spaces and in the Space of Probability Measures , 2005 .

[41]  V. Borkar Stochastic Approximation: A Dynamical Systems Viewpoint , 2008, Texts and Readings in Mathematics.

[42]  Zhaoran Wang,et al.  A Two-Timescale Framework for Bilevel Optimization: Complexity Analysis and Application to Actor-Critic , 2020, ArXiv.

[43]  Zhaoran Wang,et al.  Neural Policy Gradient Methods: Global Optimality and Rates of Convergence , 2019, ICLR.

[44]  L. Ambrosio,et al.  A User’s Guide to Optimal Transport , 2013 .

[45]  Shie Mannor,et al.  Regularized Policy Iteration with Nonparametric Function Spaces , 2016, J. Mach. Learn. Res..

[46]  Colin Wei,et al.  Regularization Matters: Generalization and Optimization of Neural Nets v.s. their Induced Kernel , 2018, NeurIPS.

[47]  Francis Bach,et al.  A Note on Lazy Training in Supervised Differentiable Programming , 2018, ArXiv.

[48]  Shalabh Bhatnagar,et al.  Natural actor-critic algorithms , 2009, Autom..

[49]  Yingbin Liang,et al.  Improving Sample Complexity Bounds for Actor-Critic Algorithms , 2020, ArXiv.

[50]  Qi Cai,et al.  Neural Temporal-Difference Learning Converges to Global Optima , 2019, NeurIPS.

[51]  Justin A. Sirignano,et al.  Mean field analysis of neural networks: A central limit theorem , 2018, Stochastic Processes and their Applications.

[52]  Demis Hassabis,et al.  Mastering the game of Go without human knowledge , 2017, Nature.

[53]  Mengdi Wang,et al.  Sample-Optimal Parametric Q-Learning Using Linearly Additive Features , 2019, ICML.

[54]  C. Villani Topics in Optimal Transportation , 2003 .

[55]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[56]  Chi Jin,et al.  Provably Efficient Exploration in Policy Optimization , 2019, ICML.

[57]  Pieter Abbeel,et al.  Benchmarking Deep Reinforcement Learning for Continuous Control , 2016, ICML.

[58]  C. Villani Optimal Transport: Old and New , 2008 .

[59]  Matthieu Geist,et al.  Approximate modified policy iteration and its application to the game of Tetris , 2015, J. Mach. Learn. Res..

[60]  Zhuoran Yang,et al.  Single-Timescale Actor-Critic Provably Finds Globally Optimal Policy , 2020, ICLR.

[61]  Andrea Montanari,et al.  A mean field view of the landscape of two-layer neural networks , 2018, Proceedings of the National Academy of Sciences.

[62]  Tong Zhang,et al.  Convex Formulation of Overparameterized Deep Neural Networks , 2019, IEEE Transactions on Information Theory.

[63]  Yufeng Zhang,et al.  Can Temporal-Difference and Q-Learning Learn Representation? A Mean-Field Theory , 2020, NeurIPS.

[64]  Konstantinos Spiliopoulos,et al.  Mean Field Analysis of Neural Networks: A Law of Large Numbers , 2018, SIAM J. Appl. Math..

[65]  Jianfeng Lu,et al.  Temporal-difference learning for nonlinear value function approximation in the lazy training regime , 2019, ArXiv.