Exploration-Enhanced POLITEX

We study algorithms for average-cost reinforcement learning problems with value function approximation. Our starting point is the recently proposed POLITEX algorithm, a version of policy iteration where the policy produced in each iteration is near-optimal in hindsight for the sum of all past value function estimates. POLITEX has sublinear regret guarantees in uniformly-mixing MDPs when the value estimation error can be controlled, which can be satisfied if all policies sufficiently explore the environment. Unfortunately, this assumption is often unrealistic. Motivated by the rapid growth of interest in developing policies that learn to explore their environment in the lack of rewards (also known as no-reward learning), we replace the previous assumption that all policies explore the environment with that a single, sufficiently exploring policy is available beforehand. The main contribution of the paper is the modification of POLITEX to incorporate such an exploration policy in a way that allows us to obtain a regret guarantee similar to the previous one but without requiring that all policies explore environment. In addition to the novel theoretical guarantees, we demonstrate the benefits of our scheme on environments which are difficult to explore using simple schemes like dithering. While the solution we obtain may not achieve the best possible regret, it is the first result that shows how to control the regret in the presence of function approximation errors on problems where exploration is nontrivial. Our approach can also be seen as a way of reducing the problem of minimizing the regret to learning a good exploration policy. We believe that modular approaches like ours can be highly beneficial in tackling harder control problems.

[1]  Bo Liu,et al.  Regularized Off-Policy TD-Learning , 2012, NIPS.

[2]  Tom Schaul,et al.  Prioritized Experience Replay , 2015, ICLR.

[3]  S. Ioffe,et al.  Temporal Differences-Based Policy Iteration and Applications in Neuro-Dynamic Programming , 1996 .

[4]  Sham M. Kakade,et al.  Provably Efficient Maximum Entropy Exploration , 2018, ICML.

[5]  Marek Petrik,et al.  Finite-Sample Analysis of Proximal Gradient TD Algorithms , 2015, UAI.

[6]  Csaba Szepesvari,et al.  Learning near-optimal policies with fitted policy iteration and a single sample path , 2005 .

[7]  Nevena Lazic,et al.  Model-Free Linear Quadratic Control via Reduction to Expert Prediction , 2018, AISTATS.

[8]  David Silver,et al.  Deep Reinforcement Learning with Double Q-Learning , 2015, AAAI.

[9]  Dimitri P. Bertsekas,et al.  Convergence Results for Some Temporal Difference Methods Based on Least Squares , 2009, IEEE Transactions on Automatic Control.

[10]  Shalabh Bhatnagar,et al.  Toward Off-Policy Learning Control with Function Approximation , 2010, ICML.

[11]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[12]  Rémi Munos,et al.  Minimax Regret Bounds for Reinforcement Learning , 2017, ICML.

[13]  Alexei A. Efros,et al.  Large-Scale Study of Curiosity-Driven Learning , 2018, ICLR.

[14]  Tom Schaul,et al.  Dueling Network Architectures for Deep Reinforcement Learning , 2015, ICML.

[15]  Benjamin Van Roy,et al.  Average cost temporal-difference learning , 1997, Proceedings of the 36th IEEE Conference on Decision and Control.

[16]  Richard S. Sutton,et al.  Multi-step Reinforcement Learning: A Unifying Algorithm , 2017, AAAI.

[17]  John N. Tsitsiklis,et al.  Analysis of Temporal-Diffference Learning with Function Approximation , 1996, NIPS.

[18]  Lihong Li,et al.  PAC model-free reinforcement learning , 2006, ICML.

[19]  David Budden,et al.  Distributed Prioritized Experience Replay , 2018, ICLR.

[20]  Dimitri P. Bertsekas,et al.  Error Bounds for Approximations from Projected Linear Equations , 2010, Math. Oper. Res..

[21]  R. Sutton,et al.  A convergent O ( n ) algorithm for off-policy temporal-difference learning with linear function approximation , 2008, NIPS 2008.

[22]  Michael I. Jordan,et al.  Is Q-learning Provably Efficient? , 2018, NeurIPS.

[23]  Peter L. Bartlett,et al.  POLITEX: Regret Bounds for Policy Iteration using Expert Prediction , 2019, ICML.

[24]  Huizhen Yu,et al.  Convergence of Least Squares Temporal Difference Methods Under General Conditions , 2010, ICML.

[25]  Alessandro Lazaric,et al.  Finite-sample analysis of least-squares policy iteration , 2012, J. Mach. Learn. Res..

[26]  Matthieu Geist,et al.  Off-policy learning with eligibility traces: a survey , 2013, J. Mach. Learn. Res..

[27]  Sebastian Thrun,et al.  Active Exploration in Dynamic Environments , 1991, NIPS.

[28]  D. Bertsekas Approximate policy iteration: a survey and some new methods , 2011 .

[29]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[30]  J. Urgen Schmidhuber,et al.  Adaptive confidence and adaptive curiosity , 1991, Forschungsberichte, TU Munich.

[31]  Benjamin Van Roy,et al.  Generalization and Exploration via Randomized Value Functions , 2014, ICML.

[32]  Shie Mannor,et al.  Regularized Policy Iteration with Nonparametric Function Spaces , 2016, J. Mach. Learn. Res..

[33]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[34]  Dimitri P. Bertsekas,et al.  Temporal Dierences-Based Policy Iteration and Applications in Neuro-Dynamic Programming 1 , 1997 .

[35]  Francesco Orabona,et al.  Scale-Free Algorithms for Online Linear Optimization , 2015, ALT.

[36]  Yuval Tassa,et al.  DeepMind Control Suite , 2018, ArXiv.