Maximum Entropy RL (Provably) Solves Some Robust RL Problems

Many potential applications of reinforcement learning (RL) require guarantees that the agent will perform well in the face of disturbances to the dynamics or reward function. In this paper, we prove theoretically that standard maximum entropy RL is robust to some disturbances in the dynamics and the reward function. While this capability of MaxEnt RL has been observed empirically in prior work, to the best of our knowledge our work provides the first rigorous proof and theoretical characterization of the MaxEnt RL robust set. While a number of prior robust RL algorithms have been designed to handle similar disturbances to the reward function or dynamics, these methods typically require adding additional moving parts and hyperparameters on top of a base RL algorithm. In contrast, our theoretical results suggest that MaxEnt RL by itself is robust to certain disturbances, without requiring any additional modifications. While this does not imply that MaxEnt RL is the best available robust RL method, MaxEnt RL does possess a striking simplicity and appealing formal guarantees.

[1]  A. Dawid,et al.  Game theory, maximum entropy, minimum discrepancy and robust Bayesian decision theory , 2004, math/0410076.

[2]  Sergey Levine,et al.  (CAD)$^2$RL: Real Single-Image Flight without a Single Real Image , 2016, Robotics: Science and Systems.

[3]  Anind K. Dey,et al.  Maximum Causal Entropy Correlated Equilibria for Markov Games , 2011, Interactive Decision Theory and Game Theory.

[4]  Ruslan Salakhutdinov,et al.  Worst Cases Policy Gradients , 2019, CoRL.

[5]  Yuval Tassa,et al.  Maximum a Posteriori Policy Optimisation , 2018, ICLR.

[6]  Matthieu Geist,et al.  A Theory of Regularized Markov Decision Processes , 2019, ICML.

[7]  Anca D. Dragan,et al.  Inverse Reward Design , 2017, NIPS.

[8]  Sergey Levine,et al.  Learning to Walk via Deep Reinforcement Learning , 2018, Robotics: Science and Systems.

[9]  Bruno Scherrer,et al.  Leverage the Average: an Analysis of Regularization in RL , 2020, ArXiv.

[10]  Vicenç Gómez,et al.  A unified view of entropy-regularized Markov decision processes , 2017, ArXiv.

[11]  Jing Peng,et al.  Function Optimization using Connectionist Reinforcement Learning Algorithms , 1991 .

[12]  Sergey Levine,et al.  Meta-World: A Benchmark and Evaluation for Multi-Task and Meta Reinforcement Learning , 2019, CoRL.

[13]  Wojciech Kotlowski,et al.  The Many Faces of Exponential Weights in Online Learning , 2018, COLT.

[14]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[15]  Shai Shalev-Shwartz,et al.  Online Learning and Online Convex Optimization , 2012, Found. Trends Mach. Learn..

[16]  Kyungjae Lee,et al.  Tsallis Reinforcement Learning: A Unified Framework for Maximum Entropy Reinforcement Learning , 2019, ArXiv.

[17]  Sehoon Ha,et al.  Learning to be Safe: Deep RL with a Safety Critic , 2020, ArXiv.

[18]  Roy Fox,et al.  Taming the Noise in Reinforcement Learning via Soft Updates , 2015, UAI.

[19]  Hagai Attias,et al.  Planning by Probabilistic Inference , 2003, AISTATS.

[20]  Marc Toussaint,et al.  Robot trajectory optimization using approximate inference , 2009, ICML '09.

[21]  Nicolas Le Roux,et al.  Understanding the impact of entropy in policy learning , 2018 .

[22]  Stuart J. Russell,et al.  Understanding Learned Reward Functions , 2020, ArXiv.

[23]  Andrew Y. Ng,et al.  Solving Uncertain Markov Decision Processes , 2001 .

[24]  Bruce A. Francis,et al.  Feedback Control Theory , 1992 .

[25]  Samy Bengio,et al.  A Study on Overfitting in Deep Reinforcement Learning , 2018, ArXiv.

[26]  Anca D. Dragan,et al.  LESS is More: Rethinking Probabilistic Models of Human Behavior , 2020, 2020 15th ACM/IEEE International Conference on Human-Robot Interaction (HRI).

[27]  Julian Togelius,et al.  Illuminating Generalization in Deep Reinforcement Learning through Procedural Level Generation , 2018, 1806.10729.

[28]  Mohammad Ghavamzadeh,et al.  Lyapunov-based Safe Policy Optimization for Continuous Control , 2019, ArXiv.

[29]  Sergey Levine,et al.  Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor , 2018, ICML.

[30]  J. Doyle,et al.  Robust and optimal control , 1995, Proceedings of 35th IEEE Conference on Decision and Control.

[31]  Alessandro Lazaric,et al.  Transfer in Reinforcement Learning: A Framework and a Survey , 2012, Reinforcement Learning.

[32]  Emanuel Todorov,et al.  Linearly-solvable Markov decision problems , 2006, NIPS.

[33]  Jun Morimoto,et al.  Robust Reinforcement Learning , 2005, Neural Computation.

[34]  Brijen Thananjeyan,et al.  Recovery RL: Safe Reinforcement Learning With Learned Recovery Zones , 2020, IEEE Robotics and Automation Letters.

[35]  Balaraman Ravindran,et al.  EPOpt: Learning Robust Neural Network Policies Using Model Ensembles , 2016, ICLR.

[36]  John Schulman,et al.  Concrete Problems in AI Safety , 2016, ArXiv.

[37]  Taehoon Kim,et al.  Quantifying Generalization in Reinforcement Learning , 2018, ICML.

[38]  O. H. Brownlee,et al.  ACTIVITY ANALYSIS OF PRODUCTION AND ALLOCATION , 1952 .

[39]  J. Andrew Bagnell,et al.  Modeling Purposeful Adaptive Behavior with the Principle of Maximum Causal Entropy , 2010 .

[40]  Mi-Ching Tsai,et al.  Robust and Optimal Control , 2014 .

[41]  Nikolaus Hansen,et al.  The CMA Evolution Strategy: A Tutorial , 2016, ArXiv.

[42]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[43]  Abhinav Gupta,et al.  Robust Adversarial Reinforcement Learning , 2017, ICML.

[44]  Nicolas Le Roux,et al.  Understanding the impact of entropy on policy optimization , 2018, ICML.

[45]  Sergey Levine,et al.  Continuous Deep Q-Learning with Model-based Acceleration , 2016, ICML.

[46]  Marco Pavone,et al.  Risk-Constrained Reinforcement Learning with Percentile Risk Criteria , 2015, J. Mach. Learn. Res..

[47]  Dale Schuurmans,et al.  Bridging the Gap Between Value and Policy Based Reinforcement Learning , 2017, NIPS.

[48]  Laurent El Ghaoui,et al.  Robustness in Markov Decision Problems with Uncertain Transition Matrices , 2003, NIPS.

[49]  Yuval Tassa,et al.  Learning Continuous Control Policies by Stochastic Value Gradients , 2015, NIPS.

[50]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[51]  Romain Laroche,et al.  Budgeted Reinforcement Learning in Continuous State Space , 2019, NeurIPS.

[52]  Pieter Abbeel,et al.  Constrained Policy Optimization , 2017, ICML.

[53]  Sergey Levine,et al.  Search on the Replay Buffer: Bridging Planning and Reinforcement Learning , 2019, NeurIPS.

[54]  Hang Su,et al.  SVQN: Sequential Variational Soft Q-Learning Networks , 2020, ICLR.

[55]  Sergey Levine,et al.  Leave no Trace: Learning to Reset for Safe and Autonomous Reinforcement Learning , 2017, ICLR.

[56]  Aleksander Madry,et al.  Adversarial Examples Are Not Bugs, They Are Features , 2019, NeurIPS.

[57]  Sergey Levine,et al.  Reinforcement Learning and Control as Probabilistic Inference: Tutorial and Review , 2018, ArXiv.

[58]  Marcin Andrychowicz,et al.  Sim-to-Real Transfer of Robotic Control with Dynamics Randomization , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).