Reward-Weighted Regression Converges to a Global Optimum

Reward-Weighted Regression (RWR) belongs to a family of widely known iterative Reinforcement Learning algorithms based on the Expectation-Maximization framework. In this family, learning at each iteration consists of sampling a batch of trajectories using the current policy and fitting a new policy to maximize a return-weighted log-likelihood of actions. Although RWR is known to yield monotonic improvement of the policy under certain circumstances, whether and under which conditions RWR converges to the optimal policy have remained open questions. In this paper, we provide for the first time a proof that RWR converges to a global optimum when no function approximation is used, in a general compact setting. Furthermore, for the simpler case with finite state and action spaces we prove R-linear convergence of the state-value function to the optimum.

[1]  Sameera S. Ponda,et al.  Autonomous navigation of stratospheric balloons using reinforcement learning , 2020, Nature.

[2]  Sergey Levine,et al.  Advantage-Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning , 2019, ArXiv.

[3]  Yuval Tassa,et al.  Relative Entropy Regularized Policy Iteration , 2018, ArXiv.

[4]  S. Ana,et al.  Topology , 2018, International Journal of Mathematics Trends and Technology.

[5]  Marcin Andrychowicz,et al.  Multi-Goal Reinforcement Learning: Challenging Robotics Environments and Request for Research , 2018, ArXiv.

[6]  Yuval Tassa,et al.  Maximum a Posteriori Policy Optimisation , 2018, ICLR.

[7]  Masashi Sugiyama,et al.  Hierarchical Policy Search via Return-Weighted Density Estimation , 2017, AAAI.

[8]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[9]  Yoshinobu Kawahara,et al.  Weighted Likelihood Policy Search with Model Selection , 2012, NIPS.

[10]  Jan Peters,et al.  Reward-Weighted Regression with Sample Reuse for Direct Policy Search in Reinforcement Learning , 2011, Neural Computation.

[11]  Gerhard Neumann,et al.  Variational Inference for Policy Search in changing situations , 2011, ICML.

[12]  Jan Peters,et al.  Relative Entropy Policy Search , 2010, AAAI.

[13]  Masashi Sugiyama,et al.  Efficient Sample Reuse in EM-Based Policy Search , 2009, ECML/PKDD.

[14]  Shalabh Bhatnagar,et al.  Fast gradient-descent methods for temporal-difference learning with linear function approximation , 2009, ICML '09.

[15]  Jan Peters,et al.  Fitted Q-iteration by Advantage Weighted Regression , 2008, NIPS.

[16]  Tom Schaul,et al.  Fitness Expectation Maximization , 2008, PPSN.

[17]  Tom Schaul,et al.  Episodic Reinforcement Learning by Logistic Reward-Weighted Regression , 2008, ICANN.

[18]  V. Borkar Stochastic Approximation: A Dynamical Systems Viewpoint , 2008 .

[19]  Stefan Schaal,et al.  Learning to Control in Operational Space , 2008, Int. J. Robotics Res..

[20]  Csaba Szepesvári,et al.  Fitted Q-iteration in continuous action-space MDPs , 2007, NIPS.

[21]  Stefan Schaal,et al.  Reinforcement learning by reward-weighted regression for operational space control , 2007, ICML '07.

[22]  Pierre Geurts,et al.  Tree-Based Batch Mode Reinforcement Learning , 2005, J. Mach. Learn. Res..

[23]  Martin A. Riedmiller Neural Fitted Q Iteration - First Experiences with a Data Efficient Neural Reinforcement Learning Method , 2005, ECML.

[24]  R. Taylor A User's Guide to Measure-Theoretic Probability , 2003 .

[25]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[26]  Doina Precup,et al.  Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning , 1999, Artif. Intell..

[27]  Geoffrey E. Hinton,et al.  Using Expectation-Maximization for Reinforcement Learning , 1997, Neural Computation.

[28]  M. Puterman Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[29]  New York Dover,et al.  ON THE CONVERGENCE PROPERTIES OF THE EM ALGORITHM , 1983 .

[30]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[31]  Richard S. Sutton,et al.  A Convergent O(n) Temporal-difference Algorithm for Off-policy Learning with Linear Function Approximation , 2008, NIPS.

[32]  C. Malsburg Self-organization of orientation sensitive cells in the striate cortex , 2004, Kybernetik.

[33]  R. Sutton,et al.  Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning , 1999 .

[34]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[35]  R. Bass,et al.  Review: P. Billingsley, Convergence of probability measures , 1971 .

[36]  W. Rudin Principles of mathematical analysis , 1964 .

[37]  R. L. Stratonovich CONDITIONAL MARKOV PROCESSES , 1960 .

[38]  Jan Peters,et al.  Noname manuscript No. (will be inserted by the editor) Policy Search for Motor Primitives in Robotics , 2022 .