Trust Region Policy Optimization for POMDPs

We propose Generalized Trust Region Policy Optimization (GTRPO), a Reinforcement Learning algorithm for TRPO of Partially Observable Markov Decision Processes (POMDP). While the principle of policy gradient methods does not require any model assumption, previous studies of more sophisticated policy gradient methods are mainly limited to MDPs. Many real-world decision-making tasks, however, are inherently non-Markovian, i.e., only an incomplete representation of the environment is observable. Moreover, most of the advanced policy gradient methods are designed for infinite horizon MDPs. Our proposed algorithm, GTRPO, is a policy gradient method for continuous episodic POMDPs. We prove that its policy updates monotonically improve the expected cumulative return. We empirically study GTRPO on many RoboSchool environments, an extension to the MuJoCo environments, and provide insights into its empirical behavior.

[1]  Nan Jiang,et al.  The Dependence of Effective Planning Horizon on Model Accuracy , 2015, AAMAS.

[2]  Sham M. Kakade,et al.  A Natural Policy Gradient , 2001, NIPS.

[3]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[4]  David Barber,et al.  On the Computational Complexity of Stochastic Controller Optimization in POMDPs , 2011, TOCT.

[5]  Zachary Chase Lipton,et al.  Combating Deep Reinforcement Learning's Sisyphean Curse with Intrinsic Fear , 2016, 1611.01211.

[6]  John N. Tsitsiklis,et al.  The Complexity of Markov Decision Processes , 1987, Math. Oper. Res..

[7]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[8]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[9]  Peter L. Bartlett,et al.  Infinite-Horizon Policy-Gradient Estimation , 2001, J. Artif. Intell. Res..

[10]  Kamyar Azizzadenesheli,et al.  Open Problem: Approximate Planning of POMDPs in the class of Memoryless Policies , 2016, COLT.

[11]  John Langford,et al.  Approximately Optimal Approximate Reinforcement Learning , 2002, ICML.

[12]  John M. Lee Riemannian Manifolds: An Introduction to Curvature , 1997 .

[13]  Kamyar Azizzadenesheli,et al.  signSGD: compressed optimisation for non-convex problems , 2018, ICML.

[14]  Anne Condon,et al.  On the Undecidability of Probabilistic Planning and Infinite-Horizon Partially Observable Markov Decision Problems , 1999, AAAI/IAAI.

[15]  Kamyar Azizzadenesheli,et al.  Experimental results : Reinforcement Learning of POMDPs using Spectral Methods , 2017, ArXiv.

[16]  F. Opitz Information geometry and its applications , 2012, 2012 9th European Radar Conference.

[17]  Jianfeng Gao,et al.  Combating Reinforcement Learning's Sisyphean Curse with Intrinsic Fear , 2016, ArXiv.

[18]  Yuval Tassa,et al.  MuJoCo: A physics engine for model-based control , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[19]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[20]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[21]  Kamyar Azizzadenesheli,et al.  Reinforcement Learning of POMDPs using Spectral Methods , 2016, COLT.

[22]  Ronald J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[23]  Keyan Zahedi,et al.  Geometry and Determinism of Optimal Stationary Control in Partially Observable Markov Decision Processes , 2015, ArXiv.

[24]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.