论文信息 - Policy Gradient Bayesian Robust Optimization for Imitation Learning

Policy Gradient Bayesian Robust Optimization for Imitation Learning

The difficulty in specifying rewards for many realworld problems has led to an increased focus on learning rewards from human feedback, such as demonstrations. However, there are often many different reward functions that explain the human feedback, leaving agents with uncertainty over what the true reward function is. While most policy optimization approaches handle this uncertainty by optimizing for expected performance, many applications demand risk-averse behavior. We derive a novel policy gradient-style robust optimization approach, PG-BROIL, that optimizes a soft-robust objective that balances expected performance and risk. To the best of our knowledge, PG-BROIL is the first policy optimization algorithm robust to a distribution of reward hypotheses which can scale to continuous MDPs. Results suggest that PG-BROIL can produce a family of behaviors ranging from risk-neutral to risk-averse and outperforms state-of-the-art imitation learning algorithms when learning from ambiguous demonstrations by hedging against uncertainty, rather than seeking to uniquely identify the demonstrator’s reward function.

[1] Stefano Ermon,et al. Generative Adversarial Imitation Learning , 2016, NIPS.

[2] Anca D. Dragan,et al. Simplifying Reward Design through Divide-and-Conquer , 2018, Robotics: Science and Systems.

[3] Ruslan Salakhutdinov,et al. Worst Cases Policy Gradients , 2019, CoRL.

[4] Pieter Abbeel,et al. An Algorithmic Perspective on Imitation Learning , 2018, Found. Trends Robotics.

[5] Reazul Hasan Russel,et al. Entropic Risk Constrained Soft-Robust Policy Optimization , 2020, ArXiv.

[6] Sergey Levine,et al. High-Dimensional Continuous Control Using Generalized Advantage Estimation , 2015, ICLR.

[7] Philippe Artzner,et al. Coherent Measures of Risk , 1999 .

[8] R. Rockafellar,et al. Optimization of conditional value-at risk , 2000 .

[9] Anca D. Dragan,et al. Learning a Prior over Intent via Meta-Inverse Reinforcement Learning , 2018, ICML.

[10] F. Delbaen. Coherent Risk Measures on General Probability Spaces , 2002 .

[11] Scott Niekum,et al. Safe Imitation Learning via Fast Bayesian Reward Inference from Preferences , 2020, ICML.

[12] Brijen Thananjeyan,et al. LazyDAgger: Reducing Context Switching in Interactive Imitation Learning , 2021, 2021 IEEE 17th International Conference on Automation Science and Engineering (CASE).

[13] Prashant Doshi,et al. A Survey of Inverse Reinforcement Learning: Challenges, Methods and Progress , 2018, Artif. Intell..

[14] Marek Petrik,et al. Bayesian Robust Optimization for Imitation Learning , 2020, NeurIPS.

[15] Sergey Levine,et al. Learning Robust Rewards with Adversarial Inverse Reinforcement Learning , 2017, ICLR 2017.

[16] Mohammad Ghavamzadeh,et al. Soft-Robust Algorithms for Handling Model Misspecification , 2020, ArXiv.

[17] Stefan Schaal,et al. 2008 Special Issue: Reinforcement learning of motor skills with policy gradients , 2008 .

[18] Shimon Whiteson,et al. Mean-Variance Policy Iteration for Risk-Averse Reinforcement Learning , 2020, AAAI.

[19] Jan Peters,et al. Entropic Risk Measure in Policy Search , 2019, 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[20] H. Föllmer,et al. ENTROPIC RISK MEASURES: COHERENCE VS. CONVEXITY, MODEL AMBIGUITY AND ROBUST LARGE DEVIATIONS , 2011 .

[21] Javier García,et al. A comprehensive survey on safe reinforcement learning , 2015, J. Mach. Learn. Res..

[22] Sergey Levine,et al. Guided Cost Learning: Deep Inverse Optimal Control via Policy Optimization , 2016, ICML.

[23] Pieter Abbeel,et al. Constrained Policy Optimization , 2017, ICML.

[24] Matthias Heger,et al. Consideration of Risk in Reinforcement Learning , 1994, ICML.

[25] Eyal Amir,et al. Bayesian Inverse Reinforcement Learning , 2007, IJCAI.

[26] Marco Pavone,et al. Risk-sensitive Inverse Reinforcement Learning via Coherent Risk Models , 2017, Robotics: Science and Systems.

[27] Anind K. Dey,et al. Maximum Entropy Inverse Reinforcement Learning , 2008, AAAI.

[28] S. Levine,et al. Safety Augmented Value Estimation From Demonstrations (SAVED): Safe Deep Model-Based RL for Sparse Cost Robotic Tasks , 2019, IEEE Robotics and Automation Letters.

[29] John Schulman,et al. Concrete Problems in AI Safety , 2016, ArXiv.

[30] Anca D. Dragan,et al. Active Preference-Based Learning of Reward Functions , 2017, Robotics: Science and Systems.

[31] Dean Pomerleau,et al. Efficient Training of Artificial Neural Networks for Autonomous Navigation , 1991, Neural Computation.