Policy Gradient Reinforcement Learning Without Regret

This thesis consists of two independent projects, each contributing to a central goal of artificial intelligence research: to build computer systems that are capable of performing tasks and solving problems without problemspecific direction from us, their designers. I focus on two formal learning problems that have a strong mathematical grounding. Many real-world learning problems can be cast as instances of one of these two problems. Whenever our translation from the real to the formal accurately captures the character of the problem, then the mathematical arguments we make about algorithms in the formal setting will approximately hold in the realworld as well. The first project focuses on an open question in the theory of policy gradient reinforcement learning methods. These methods learn by trial and error and decide whether a trial was good or bad by comparing its outcome to a given baseline. The baseline has no impact on the formal asymptotic guarantees for policy gradient methods, but it does alter their finite-time behaviour. This immediately raises the question: which baseline should we use? I propose that the baseline should be chosen such that a certain estimate used internally by policy gradient methods has the smallest error. I prove that, under slightly idealistic assumptions, this baseline gives a good upper bound on the regret of policy gradient methods. I derive closed-form expressions for this baseline in terms of properties of the formal learning problem and the computer’s behaviour. The quantities appearing in the ii closed form expressions are often unknown, so I also propose two algorithms for estimating this baseline from only known quantities. Finally, I present an empirical comparison of commonly used baselines that demonstrates improved performance when using my proposed baseline. The second project focuses on a recently proposed class of formal learning problems that is in the intersection of two fields of computing science research: reinforcement learning and online learning. The considered problems are sometimes called online Markov decision processes, or Markov decision processes with changing rewards. The unique property of this class is that it assumes the computer’s environment is adversarial, as though it were playing a game against the computer. This is in contrast to the more common assumption that the environment’s behaviour is determined entirely by stochastic models. I propose three new algorithms for learning in Markov decision processes with changing rewards under various conditions. I prove theoretical performance guarantees for each algorithm that either complement or improve the best existing results and that often hold even under weaker assumptions. This comes at the cost of increased (but still polynomial) computational complexity. Finally, in the development and analysis of these algorithms, it was necessary to analyze an approximate version of a well-known optimization algorithm called online mirror ascent. To the best of my knowledge, this is the first rigorous analysis of this algorithm and it is of independent interest. iii

[1]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[2]  Shie Mannor,et al.  Markov Decision Processes with Arbitrary Reward Processes , 2008, Math. Oper. Res..

[3]  Yishay Mansour,et al.  Online Markov Decision Processes , 2009, Math. Oper. Res..

[4]  András György,et al.  Online Learning in Markov Decision Processes with Changing Cost Sequences , 2014, ICML.

[5]  Yishay Mansour,et al.  Experts in a Markov Decision Process , 2004, NIPS.

[6]  Ronald J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[7]  Elad Hazan,et al.  Competing in the Dark: An Efficient Algorithm for Bandit Linear Optimization , 2008, COLT.

[8]  Wang Feng,et al.  Online Learning Algorithms for Big Data Analytics: A Survey , 2015 .

[9]  Csaba Szepesvári,et al.  Online Markov Decision Processes Under Bandit Feedback , 2010, IEEE Transactions on Automatic Control.

[10]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[11]  Csaba Szepesvári,et al.  Algorithms for Reinforcement Learning , 2010, Synthesis Lectures on Artificial Intelligence and Machine Learning.

[12]  Sham M. Kakade,et al.  Towards Minimax Policies for Online Linear Optimization with Bandit Feedback , 2012, COLT.

[13]  Shai Shalev-Shwartz,et al.  Online Learning and Online Convex Optimization , 2012, Found. Trends Mach. Learn..

[14]  A. Nemirovski Advances in convex optimization : conic programming , 2005 .

[15]  Peter L. Bartlett,et al.  Variance Reduction Techniques for Gradient Estimates in Reinforcement Learning , 2001, J. Mach. Learn. Res..