Provably Efficient Exploration in Policy Optimization

While policy-based reinforcement learning (RL) achieves tremendous successes in practice, it is significantly less understood in theory, especially compared with value-based RL. In particular, it remains elusive how to design a provably efficient policy optimization algorithm that incorporates exploration. To bridge such a gap, this paper proposes an Optimistic variant of the Proximal Policy Optimization algorithm (OPPO), which follows an ``optimistic version'' of the policy gradient direction. This paper proves that, in the problem of episodic Markov decision process with linear function approximation, unknown transition, and adversarial reward with full-information feedback, OPPO achieves $\tilde{O}(\sqrt{d^2 H^3 T} )$ regret. Here $d$ is the feature dimension, $H$ is the episode horizon, and $T$ is the total number of steps. To the best of our knowledge, OPPO is the first provably efficient policy optimization algorithm that explores.

[1]  Quanquan Gu,et al.  Provably Efficient Reinforcement Learning for Discounted MDPs with Feature Mapping , 2020, ICML.

[2]  Sham M. Kakade,et al.  On the Theory of Policy Gradient Methods: Optimality, Approximation, and Distribution Shift , 2019, J. Mach. Learn. Res..

[3]  Yishay Mansour,et al.  Adversarial Stochastic Shortest Path , 2020, ArXiv.

[4]  Mengdi Wang,et al.  Model-Based Reinforcement Learning with Value-Targeted Regression , 2020, L4DC.

[5]  Csaba Szepesvari,et al.  Learning with Good Feature Representations in Bandits and in RL with a Generative Model , 2019, ICML.

[6]  Lin F. Yang,et al.  Is a Good Representation Sufficient for Sample Efficient Reinforcement Learning? , 2019, ICLR.

[7]  Jian Peng,et al.  √n-Regret for Learning in Markov Decision Processes with Function Approximation and Low Bellman Rank , 2019, COLT.

[8]  Zhaoran Wang,et al.  Neural Policy Gradient Methods: Global Optimality and Rates of Convergence , 2019, ICLR.

[9]  S. Kakade,et al.  Optimality and Approximation with Policy Gradient Methods in Markov Decision Processes , 2019, COLT.

[10]  Michael I. Jordan,et al.  Provably Efficient Reinforcement Learning with Linear Function Approximation , 2019, COLT.

[11]  Mengdi Wang,et al.  Reinforcement Leaning in Feature Space: Matrix Bandit, Kernels, and Regret Bound , 2019, ICML.

[12]  Zhuoran Yang,et al.  A Theoretical Analysis of Deep Q-Learning , 2019, L4DC.

[13]  Benjamin Van Roy,et al.  Comments on the Du-Kakade-Wang-Yang Lower Bounds , 2019, ArXiv.

[14]  Nevena Lazic,et al.  Exploration-Enhanced POLITEX , 2019, ArXiv.

[15]  Yongxin Chen,et al.  On the Global Convergence of Actor-Critic: A Case for Linear Quadratic Regulator with Ergodic Cost , 2019, ArXiv.

[16]  Qi Cai,et al.  Neural Proximal/Trust Region Policy Optimization Attains Globally Optimal Policy , 2019, ArXiv.

[17]  Ruosong Wang,et al.  Provably Efficient Q-learning with Function Approximation via Distribution Shift Error Checking Oracle , 2019, NeurIPS.

[18]  Martin J. Wainwright,et al.  Variance-reduced Q-learning is minimax optimal , 2019, ArXiv.

[19]  Jalaj Bhandari,et al.  Global Optimality Guarantees For Policy Gradient Methods , 2019, ArXiv.

[20]  Peter L. Bartlett,et al.  POLITEX: Regret Bounds for Policy Iteration using Expert Prediction , 2019, ICML.

[21]  Yishay Mansour,et al.  Online Convex Optimization in Adversarial Markov Decision Processes , 2019, ICML.

[22]  Nan Jiang,et al.  Information-Theoretic Considerations in Batch Reinforcement Learning , 2019, ICML.

[23]  Mengdi Wang,et al.  Sample-Optimal Parametric Q-Learning Using Linearly Additive Features , 2019, ICML.

[24]  Matthieu Geist,et al.  A Theory of Regularized Markov Decision Processes , 2019, ICML.

[25]  Yishay Mansour,et al.  Online Stochastic Shortest Path with Bandit Feedback and Unknown Transition Function , 2019, NeurIPS.

[26]  Michael I. Jordan,et al.  Is Q-learning Provably Efficient? , 2018, NeurIPS.

[27]  William Yang Wang,et al.  Deep Reinforcement Learning for NLP , 2018, ACL.

[28]  Benjamin Recht,et al.  Simple random search provides a competitive approach to reinforcement learning , 2018, ArXiv.

[29]  Sham M. Kakade,et al.  Global Convergence of Policy Gradient Methods for the Linear Quadratic Regulator , 2018, ICML.

[30]  Xian Wu,et al.  Variance reduced value iteration and faster algorithms for solving Markov decision processes , 2017, SODA.

[31]  Xian Wu,et al.  Near-Optimal Time and Sample Complexities for Solving Markov Decision Processes with a Generative Model , 2018, NeurIPS.

[32]  Demis Hassabis,et al.  Mastering the game of Go without human knowledge , 2017, Nature.

[33]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[34]  Marcello Restelli,et al.  Boosted Fitted Q-Iteration , 2017, ICML.

[35]  Vicenç Gómez,et al.  A unified view of entropy-regularized Markov decision processes , 2017, ArXiv.

[36]  Rémi Munos,et al.  Minimax Regret Bounds for Reinforcement Learning , 2017, ICML.

[37]  Tor Lattimore,et al.  Unifying PAC and Regret: Uniform PAC Bounds for Episodic Reinforcement Learning , 2017, NIPS.

[38]  Nan Jiang,et al.  Contextual Decision Processes with low Bellman rank are PAC-Learnable , 2016, ICML.

[39]  Zheng Wen,et al.  Efficient Reinforcement Learning in Deterministic Systems with Value Function Generalization , 2013, Math. Oper. Res..

[40]  Benjamin Van Roy,et al.  On Lower Bounds for Regret in Reinforcement Learning , 2016, ArXiv.

[41]  Pieter Abbeel,et al.  Benchmarking Deep Reinforcement Learning for Continuous Control , 2016, ICML.

[42]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[43]  Benjamin Van Roy,et al.  Generalization and Exploration via Randomized Value Functions , 2014, ICML.

[44]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[45]  Csaba Szepesvári,et al.  Online Markov Decision Processes Under Bandit Feedback , 2010, IEEE Transactions on Automatic Control.

[46]  Gergely Neu,et al.  Online learning in episodic Markovian decision processes by relative entropy policy search , 2013, NIPS.

[47]  Hilbert J. Kappen,et al.  On the Sample Complexity of Reinforcement Learning with a Generative Model , 2012, ICML.

[48]  Sébastien Bubeck,et al.  Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems , 2012, Found. Trends Mach. Learn..

[49]  András György,et al.  The adversarial stochastic shortest path problem with unknown transition probabilities , 2012, AISTATS.

[50]  Hilbert J. Kappen,et al.  Dynamic policy programming , 2010, J. Mach. Learn. Res..

[51]  Csaba Szepesvári,et al.  Improved Algorithms for Linear Stochastic Bandits , 2011, NIPS.

[52]  Hilbert J. Kappen,et al.  Speedy Q-Learning , 2011, NIPS.

[53]  Wei Chu,et al.  Contextual Bandits with Linear Payoff Functions , 2011, AISTATS.

[54]  Csaba Szepesvári,et al.  Error Propagation for Approximate Policy and Value Iteration , 2010, NIPS.

[55]  John N. Tsitsiklis,et al.  Linearly Parameterized Bandits , 2008, Math. Oper. Res..

[56]  Csaba Szepesvari,et al.  The Online Loop-free Stochastic Shortest-Path Problem , 2010, Annual Conference Computational Learning Theory.

[57]  Lin Xiao,et al.  Dual Averaging Methods for Regularized Stochastic Learning and Online Optimization , 2009, J. Mach. Learn. Res..

[58]  Yishay Mansour,et al.  Online Markov Decision Processes , 2009, Math. Oper. Res..

[59]  Peter Auer,et al.  Near-optimal Regret Bounds for Reinforcement Learning , 2008, J. Mach. Learn. Res..

[60]  Shie Mannor,et al.  Markov Decision Processes with Arbitrary Reward Processes , 2008, Math. Oper. Res..

[61]  Csaba Szepesvári,et al.  Finite-Time Bounds for Fitted Value Iteration , 2008, J. Mach. Learn. Res..

[62]  Thomas P. Hayes,et al.  Stochastic Linear Optimization under Bandit Feedback , 2008, COLT.

[63]  Csaba Szepesvári,et al.  Fitted Q-iteration in continuous action-space MDPs , 2007, NIPS.

[64]  Michael L. Littman,et al.  Efficient Reinforcement Learning with Relocatable Action Models , 2007, AAAI.

[65]  Lihong Li,et al.  PAC model-free reinforcement learning , 2006, ICML.

[66]  Gábor Lugosi,et al.  Prediction, learning, and games , 2006 .

[67]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[68]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[69]  Ronald J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[70]  Steven J. Bradtke,et al.  Linear Least-Squares algorithms for temporal difference learning , 2004, Machine Learning.

[71]  Sham M. Kakade,et al.  On the sample complexity of reinforcement learning. , 2003 .

[72]  Peter Auer,et al.  Using Confidence Bounds for Exploitation-Exploration Trade-offs , 2003, J. Mach. Learn. Res..

[73]  Sham M. Kakade,et al.  A Natural Policy Gradient , 2001, NIPS.

[74]  J. Baxter,et al.  Direct gradient-based reinforcement learning , 2000, 2000 IEEE International Symposium on Circuits and Systems. Emerging Technologies for the 21st Century. Proceedings (IEEE Cat No.00CH36353).

[75]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[76]  Justin A. Boyan,et al.  Least-Squares Temporal Difference Learning , 1999, ICML.

[77]  Vijay R. Konda,et al.  Actor-Critic Algorithms , 1999, NIPS.

[78]  Reid G. Simmons,et al.  Complexity Analysis of Real-Time Reinforcement Learning , 1993, AAAI.

[79]  John Darzentas,et al.  Problem Complexity and Method Efficiency in Optimization , 1983 .