论文信息 - Efficient Probabilistic Performance Bounds for Inverse Reinforcement Learning

Efficient Probabilistic Performance Bounds for Inverse Reinforcement Learning

In the field of reinforcement learning there has been recent progress towards safety and high-confidence bounds on policy performance. However, to our knowledge, no practical methods exist for determining high-confidence policy performance bounds in the inverse reinforcement learning setting---where the true reward function is unknown and only samples of expert behavior are given. We propose a sampling method based on Bayesian inverse reinforcement learning that uses demonstrations to determine practical high-confidence upper bounds on the $\alpha$-worst-case difference in expected return between any evaluation policy and the optimal policy under the expert's unknown reward function. We evaluate our proposed bound on both a standard grid navigation task and a simulated driving task and achieve tighter and more accurate bounds than a feature count-based baseline. We also give examples of how our proposed bound can be utilized to perform risk-aware policy selection and risk-aware policy improvement. Because our proposed bound requires several orders of magnitude fewer demonstrations than existing high-confidence bounds, it is the first practical method that allows agents that learn from demonstration to express confidence in the quality of their learned policy.

Scott Niekum | Daniel S. Brown | S. Niekum

[1] John Schulman,et al. Concrete Problems in AI Safety , 2016, ArXiv.

[2] 日本数学会,et al. Probabilistic Approach to Geometry , 2010 .

[3] Philip S. Thomas,et al. High-Confidence Off-Policy Evaluation , 2015, AAAI.

[4] Stefano Ermon,et al. Model-Free Imitation Learning with Policy Optimization , 2016, ICML.

[5] R. Rockafellar,et al. Optimization of conditional value-at risk , 2000 .

[6] Manuel Lopes,et al. Active Learning for Reward Estimation in Inverse Reinforcement Learning , 2009, ECML/PKDD.

[7] Brett Browning,et al. A survey of robot learning from demonstration , 2009, Robotics Auton. Syst..

[8] Edmund H. Durfee,et al. Comparing Action-Query Strategies in Semi-Autonomous Agents , 2011, AAAI.

[9] Nikolaus Hansen,et al. The CMA Evolution Strategy: A Comparing Review , 2006, Towards a New Evolutionary Computation.

[10] Philip S. Thomas,et al. High Confidence Policy Improvement , 2015, ICML.

[11] Sergey Levine,et al. Nonlinear Inverse Reinforcement Learning with Gaussian Processes , 2011, NIPS.

[12] Jonathan P. How,et al. Improving the efficiency of Bayesian inverse reinforcement learning , 2012, 2012 IEEE International Conference on Robotics and Automation.

[13] Robert E. Schapire,et al. A Game-Theoretic Approach to Apprenticeship Learning , 2007, NIPS.

[14] Jonathan P. How,et al. Bayesian Nonparametric Inverse Reinforcement Learning , 2012, ECML/PKDD.

[15] Manuel Lopes,et al. Affordance-based imitation learning in robots , 2007, 2007 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[16] Siyuan Liu,et al. Robust Bayesian Inverse Reinforcement Learning with Sparse Behavior Noise , 2014, AAAI.

[17] Peter Stone,et al. Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation , 2016, AAAI.

[18] Kee-Eung Kim,et al. MAP Inference for Bayesian Inverse Reinforcement Learning , 2011, NIPS.

[19] H. Föllmer,et al. ENTROPIC RISK MEASURES: COHERENCE VS. CONVEXITY, MODEL AMBIGUITY AND ROBUST LARGE DEVIATIONS , 2011 .

[20] Markus Wulfmeier,et al. Maximum Entropy Deep Inverse Reinforcement Learning , 2015, 1507.04888.

[21] Radford M. Neal. Pattern Recognition and Machine Learning , 2007, Technometrics.

[22] Eyal Amir,et al. Bayesian Inverse Reinforcement Learning , 2007, IJCAI.

[23] Meng Joo Er,et al. A survey of inverse reinforcement learning techniques , 2012, Int. J. Intell. Comput. Cybern..

[24] Bruno Castro da Silva,et al. On Ensuring that Intelligent Machines Are Well-Behaved , 2017, ArXiv.

[25] Shie Mannor,et al. Optimizing the CVaR via Sampling , 2014, AAAI.

[26] Jan Peters,et al. Relative Entropy Inverse Reinforcement Learning , 2011, AISTATS.

[27] Peter Stone,et al. High Confidence Off-Policy Evaluation with Models , 2016, ArXiv.