Policy Gradient Bayesian Robust Optimization for Imitation Learning

The difficulty in specifying rewards for many realworld problems has led to an increased focus on learning rewards from human feedback, such as demonstrations. However, there are often many different reward functions that explain the human feedback, leaving agents with uncertainty over what the true reward function is. While most policy optimization approaches handle this uncertainty by optimizing for expected performance, many applications demand risk-averse behavior. We derive a novel policy gradient-style robust optimization approach, PG-BROIL, that optimizes a soft-robust objective that balances expected performance and risk. To the best of our knowledge, PG-BROIL is the first policy optimization algorithm robust to a distribution of reward hypotheses which can scale to continuous MDPs. Results suggest that PG-BROIL can produce a family of behaviors ranging from risk-neutral to risk-averse and outperforms state-of-the-art imitation learning algorithms when learning from ambiguous demonstrations by hedging against uncertainty, rather than seeking to uniquely identify the demonstrator’s reward function.

[1]  Stefano Ermon,et al.  Generative Adversarial Imitation Learning , 2016, NIPS.

[2]  Anca D. Dragan,et al.  Simplifying Reward Design through Divide-and-Conquer , 2018, Robotics: Science and Systems.

[3]  Ruslan Salakhutdinov,et al.  Worst Cases Policy Gradients , 2019, CoRL.

[4]  Pieter Abbeel,et al.  An Algorithmic Perspective on Imitation Learning , 2018, Found. Trends Robotics.

[5]  Reazul Hasan Russel,et al.  Entropic Risk Constrained Soft-Robust Policy Optimization , 2020, ArXiv.

[6]  Sergey Levine,et al.  High-Dimensional Continuous Control Using Generalized Advantage Estimation , 2015, ICLR.

[7]  Philippe Artzner,et al.  Coherent Measures of Risk , 1999 .

[8]  R. Rockafellar,et al.  Optimization of conditional value-at risk , 2000 .

[9]  Anca D. Dragan,et al.  Learning a Prior over Intent via Meta-Inverse Reinforcement Learning , 2018, ICML.

[10]  F. Delbaen Coherent Risk Measures on General Probability Spaces , 2002 .

[11]  Scott Niekum,et al.  Safe Imitation Learning via Fast Bayesian Reward Inference from Preferences , 2020, ICML.

[12]  Brijen Thananjeyan,et al.  LazyDAgger: Reducing Context Switching in Interactive Imitation Learning , 2021, 2021 IEEE 17th International Conference on Automation Science and Engineering (CASE).

[13]  Prashant Doshi,et al.  A Survey of Inverse Reinforcement Learning: Challenges, Methods and Progress , 2018, Artif. Intell..

[14]  Marek Petrik,et al.  Bayesian Robust Optimization for Imitation Learning , 2020, NeurIPS.

[15]  Sergey Levine,et al.  Learning Robust Rewards with Adversarial Inverse Reinforcement Learning , 2017, ICLR 2017.

[16]  Mohammad Ghavamzadeh,et al.  Soft-Robust Algorithms for Handling Model Misspecification , 2020, ArXiv.

[17]  Stefan Schaal,et al.  2008 Special Issue: Reinforcement learning of motor skills with policy gradients , 2008 .

[18]  Shimon Whiteson,et al.  Mean-Variance Policy Iteration for Risk-Averse Reinforcement Learning , 2020, AAAI.

[19]  Jan Peters,et al.  Entropic Risk Measure in Policy Search , 2019, 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[20]  H. Föllmer,et al.  ENTROPIC RISK MEASURES: COHERENCE VS. CONVEXITY, MODEL AMBIGUITY AND ROBUST LARGE DEVIATIONS , 2011 .

[21]  Javier García,et al.  A comprehensive survey on safe reinforcement learning , 2015, J. Mach. Learn. Res..

[22]  Sergey Levine,et al.  Guided Cost Learning: Deep Inverse Optimal Control via Policy Optimization , 2016, ICML.

[23]  Pieter Abbeel,et al.  Constrained Policy Optimization , 2017, ICML.

[24]  Matthias Heger,et al.  Consideration of Risk in Reinforcement Learning , 1994, ICML.

[25]  Eyal Amir,et al.  Bayesian Inverse Reinforcement Learning , 2007, IJCAI.

[26]  Marco Pavone,et al.  Risk-sensitive Inverse Reinforcement Learning via Coherent Risk Models , 2017, Robotics: Science and Systems.

[27]  Anind K. Dey,et al.  Maximum Entropy Inverse Reinforcement Learning , 2008, AAAI.

[28]  S. Levine,et al.  Safety Augmented Value Estimation From Demonstrations (SAVED): Safe Deep Model-Based RL for Sparse Cost Robotic Tasks , 2019, IEEE Robotics and Automation Letters.

[29]  John Schulman,et al.  Concrete Problems in AI Safety , 2016, ArXiv.

[30]  Anca D. Dragan,et al.  Active Preference-Based Learning of Reward Functions , 2017, Robotics: Science and Systems.

[31]  Dean Pomerleau,et al.  Efficient Training of Artificial Neural Networks for Autonomous Navigation , 1991, Neural Computation.

[32]  Kyunghyun Cho,et al.  Query-Efficient Imitation Learning for End-to-End Autonomous Driving , 2016, ArXiv.

[33]  Shie Mannor,et al.  Optimizing the CVaR via Sampling , 2014, AAAI.

[34]  Prabhat Nagarajan,et al.  Extrapolating Beyond Suboptimal Demonstrations via Inverse Reinforcement Learning from Observations , 2019, ICML.

[35]  Scott Niekum,et al.  Efficient Probabilistic Performance Bounds for Inverse Reinforcement Learning , 2017, AAAI.

[36]  Anca D. Dragan,et al.  Inverse Reward Design , 2017, NIPS.

[37]  Sergey Levine,et al.  Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor , 2018, ICML.

[38]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[39]  Michael H. Bowling,et al.  Apprenticeship learning using linear programming , 2008, ICML '08.

[40]  Peter Stone,et al.  Behavioral Cloning from Observation , 2018, IJCAI.

[41]  Marc G. Bellemare,et al.  The Arcade Learning Environment: An Evaluation Platform for General Agents , 2012, J. Artif. Intell. Res..

[42]  Jaime F. Fisac,et al.  A General Safety Framework for Learning-Based Control in Uncertain Robotic Systems , 2017, IEEE Transactions on Automatic Control.

[43]  Peter Stone,et al.  Reinforcement learning from simultaneous human and MDP reward , 2012, AAMAS.

[44]  Marco Pavone,et al.  Risk-Sensitive Generative Adversarial Imitation Learning , 2018, AISTATS.

[45]  J. Andrew Bagnell,et al.  Efficient Reductions for Imitation Learning , 2010, AISTATS.

[46]  Brijen Thananjeyan,et al.  Recovery RL: Safe Reinforcement Learning With Learned Recovery Zones , 2020, IEEE Robotics and Automation Letters.

[47]  Shie Mannor,et al.  Soft-Robust Actor-Critic Policy-Gradient , 2018, UAI.

[48]  Shie Mannor,et al.  Policy Gradients Beyond Expectations: Conditional Value-at-Risk , 2014, ArXiv.

[49]  Geoffrey J. Gordon,et al.  A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning , 2010, AISTATS.

[50]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[51]  Andrea Lockerd Thomaz,et al.  Policy Shaping: Integrating Human Feedback with Reinforcement Learning , 2013, NIPS.

[52]  Pieter Abbeel,et al.  CURL: Contrastive Unsupervised Representations for Reinforcement Learning , 2020, ICML.

[53]  Shie Mannor,et al.  Percentile Optimization for Markov Decision Processes with Parameter Uncertainty , 2010, Oper. Res..

[54]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[55]  Yang Cai,et al.  Learning Safe Policies with Expert Guidance , 2018, NeurIPS.

[56]  Kee-Eung Kim,et al.  MAP Inference for Bayesian Inverse Reinforcement Learning , 2011, NIPS.

[57]  Craig Boutilier,et al.  Regret-based Reward Elicitation for Markov Decision Processes , 2009, UAI.

[58]  Pieter Abbeel,et al.  Apprenticeship learning via inverse reinforcement learning , 2004, ICML.

[59]  Shane Legg,et al.  Deep Reinforcement Learning from Human Preferences , 2017, NIPS.

[60]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[61]  Brijen Thananjeyan,et al.  ABC-LMPC: Safe Sample-Based Learning MPC for Stochastic Nonlinear Dynamical Systems with Adjustable Boundary Conditions , 2020, WAFR.