Robust Reinforcement Learning for Continuous Control with Model Misspecification

We provide a framework for incorporating robustness -- to perturbations in the transition dynamics which we refer to as model misspecification -- into continuous control Reinforcement Learning (RL) algorithms. We specifically focus on incorporating robustness into a state-of-the-art continuous control RL algorithm called Maximum a-posteriori Policy Optimization (MPO). We achieve this by learning a policy that optimizes for a worst case expected return objective and derive a corresponding robust entropy-regularized Bellman contraction operator. In addition, we introduce a less conservative, soft-robust, entropy-regularized objective with a corresponding Bellman operator. We show that both, robust and soft-robust policies, outperform their non-robust counterparts in nine Mujoco domains with environment perturbations. In addition, we show improved robust performance on a high-dimensional, simulated, dexterous robotic hand. Finally, we present multiple investigative experiments that provide a deeper insight into the robustness framework. This includes an adaptation to another continuous control RL algorithm as well as learning the uncertainty set from offline data. Performance videos can be found online at this https URL.

[1]  Laurent El Ghaoui,et al.  Robust Control of Markov Decision Processes with Uncertain Transition Matrices , 2005, Oper. Res..

[2]  Daniel Kuhn,et al.  Robust Markov Decision Processes , 2013, Math. Oper. Res..

[3]  Shie Mannor,et al.  Policy Gradients with Variance Related Risk Criteria , 2012, ICML.

[4]  Shie Mannor,et al.  Learning Robust Options , 2018, AAAI.

[5]  Jakub W. Pachocki,et al.  Learning dexterous in-hand manipulation , 2018, Int. J. Robotics Res..

[6]  Jun Morimoto,et al.  Robust Reinforcement Learning , 2005, Neural Computation.

[7]  Henry Zhu,et al.  Soft Actor-Critic Algorithms and Applications , 2018, ArXiv.

[8]  Yuval Tassa,et al.  Maximum a Posteriori Policy Optimisation , 2018, ICLR.

[9]  OpenAI Learning Dexterous In-Hand Manipulation. , 2018 .

[10]  Divyam Rastogi,et al.  Sample-efficient Reinforcement Learning via Difference Models , 2018 .

[11]  Shie Mannor,et al.  Scaling Up Robust MDPs using Function Approximation , 2014, ICML.

[12]  Sergey Levine,et al.  QT-Opt: Scalable Deep Reinforcement Learning for Vision-Based Robotic Manipulation , 2018, CoRL.

[13]  Shie Mannor,et al.  A Bayesian Approach to Robust Reinforcement Learning , 2019, UAI.

[14]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[15]  Shie Mannor,et al.  A Deep Hierarchical Approach to Lifelong Learning in Minecraft , 2016, AAAI.

[16]  Tom Schaul,et al.  Unicorn: Continual Learning with a Universal, Off-policy Agent , 2018, ArXiv.

[17]  Stefan Schaal,et al.  Path integral control and bounded rationality , 2011, 2011 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL).

[18]  Doina Precup,et al.  Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning , 1999, Artif. Intell..

[19]  Pieter Abbeel,et al.  Equivalence Between Policy Gradients and Soft Q-Learning , 2017, ArXiv.

[20]  Pieter Abbeel,et al.  Mutual Alignment Transfer Learning , 2017, CoRL.

[21]  Sergey Levine,et al.  Learning modular neural network policies for multi-task and multi-robot transfer , 2016, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[22]  Gabriel Dulac-Arnold,et al.  Challenges of Real-World Reinforcement Learning , 2019, ArXiv.

[23]  Yee Whye Teh,et al.  Distral: Robust multitask reinforcement learning , 2017, NIPS.

[24]  Daan Wierstra,et al.  Stochastic Backpropagation and Approximate Inference in Deep Generative Models , 2014, ICML.

[25]  Razvan Pascanu,et al.  Policy Distillation , 2015, ICLR.

[26]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[27]  Dale Schuurmans,et al.  Bridging the Gap Between Value and Policy Based Reinforcement Learning , 2017, NIPS.

[28]  Jordi Grau-Moya,et al.  Planning with Information-Processing Constraints and Model Uncertainty in Markov Decision Processes , 2016, ECML/PKDD.

[29]  Jim Gao,et al.  Machine Learning Applications for Data Center Optimization , 2014 .

[30]  Marcin Andrychowicz,et al.  Sim-to-Real Transfer of Robotic Control with Dynamics Randomization , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[31]  Yuval Tassa,et al.  Learning Continuous Control Policies by Stochastic Value Gradients , 2015, NIPS.

[32]  Yuval Tassa,et al.  DeepMind Control Suite , 2018, ArXiv.

[33]  Shie Mannor,et al.  Soft-Robust Actor-Critic Policy-Gradient , 2018, UAI.

[34]  Wojciech Zaremba,et al.  Transfer from Simulation to Real World through Learning Deep Inverse Dynamics Model , 2016, ArXiv.

[35]  Pieter Abbeel,et al.  Benchmarking Deep Reinforcement Learning for Continuous Control , 2016, ICML.

[36]  Martin A. Riedmiller,et al.  Learning by Playing - Solving Sparse Reward Tasks from Scratch , 2018, ICML.

[37]  Roy Fox,et al.  Taming the Noise in Reinforcement Learning via Soft Updates , 2015, UAI.

[38]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[39]  Garud Iyengar,et al.  Robust Dynamic Programming , 2005, Math. Oper. Res..

[40]  Yuval Tassa,et al.  Relative Entropy Regularized Policy Iteration , 2018, ArXiv.

[41]  Naftali Tishby,et al.  Trading Value and Information in MDPs , 2012 .