Efficient Probabilistic Performance Bounds for Inverse Reinforcement Learning

In the field of reinforcement learning there has been recent progress towards safety and high-confidence bounds on policy performance. However, to our knowledge, no practical methods exist for determining high-confidence policy performance bounds in the inverse reinforcement learning setting---where the true reward function is unknown and only samples of expert behavior are given. We propose a sampling method based on Bayesian inverse reinforcement learning that uses demonstrations to determine practical high-confidence upper bounds on the $\alpha$-worst-case difference in expected return between any evaluation policy and the optimal policy under the expert's unknown reward function. We evaluate our proposed bound on both a standard grid navigation task and a simulated driving task and achieve tighter and more accurate bounds than a feature count-based baseline. We also give examples of how our proposed bound can be utilized to perform risk-aware policy selection and risk-aware policy improvement. Because our proposed bound requires several orders of magnitude fewer demonstrations than existing high-confidence bounds, it is the first practical method that allows agents that learn from demonstration to express confidence in the quality of their learned policy.

[1]  John Schulman,et al.  Concrete Problems in AI Safety , 2016, ArXiv.

[2]  日本数学会,et al.  Probabilistic Approach to Geometry , 2010 .

[3]  Philip S. Thomas,et al.  High-Confidence Off-Policy Evaluation , 2015, AAAI.

[4]  Stefano Ermon,et al.  Model-Free Imitation Learning with Policy Optimization , 2016, ICML.

[5]  R. Rockafellar,et al.  Optimization of conditional value-at risk , 2000 .

[6]  Manuel Lopes,et al.  Active Learning for Reward Estimation in Inverse Reinforcement Learning , 2009, ECML/PKDD.

[7]  Brett Browning,et al.  A survey of robot learning from demonstration , 2009, Robotics Auton. Syst..

[8]  Edmund H. Durfee,et al.  Comparing Action-Query Strategies in Semi-Autonomous Agents , 2011, AAAI.

[9]  Nikolaus Hansen,et al.  The CMA Evolution Strategy: A Comparing Review , 2006, Towards a New Evolutionary Computation.

[10]  Philip S. Thomas,et al.  High Confidence Policy Improvement , 2015, ICML.

[11]  Sergey Levine,et al.  Nonlinear Inverse Reinforcement Learning with Gaussian Processes , 2011, NIPS.

[12]  Jonathan P. How,et al.  Improving the efficiency of Bayesian inverse reinforcement learning , 2012, 2012 IEEE International Conference on Robotics and Automation.

[13]  Robert E. Schapire,et al.  A Game-Theoretic Approach to Apprenticeship Learning , 2007, NIPS.

[14]  Jonathan P. How,et al.  Bayesian Nonparametric Inverse Reinforcement Learning , 2012, ECML/PKDD.

[15]  Manuel Lopes,et al.  Affordance-based imitation learning in robots , 2007, 2007 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[16]  Siyuan Liu,et al.  Robust Bayesian Inverse Reinforcement Learning with Sparse Behavior Noise , 2014, AAAI.

[17]  Peter Stone,et al.  Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation , 2016, AAAI.

[18]  Kee-Eung Kim,et al.  MAP Inference for Bayesian Inverse Reinforcement Learning , 2011, NIPS.

[19]  H. Föllmer,et al.  ENTROPIC RISK MEASURES: COHERENCE VS. CONVEXITY, MODEL AMBIGUITY AND ROBUST LARGE DEVIATIONS , 2011 .

[20]  Markus Wulfmeier,et al.  Maximum Entropy Deep Inverse Reinforcement Learning , 2015, 1507.04888.

[21]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[22]  Eyal Amir,et al.  Bayesian Inverse Reinforcement Learning , 2007, IJCAI.

[23]  Meng Joo Er,et al.  A survey of inverse reinforcement learning techniques , 2012, Int. J. Intell. Comput. Cybern..

[24]  Bruno Castro da Silva,et al.  On Ensuring that Intelligent Machines Are Well-Behaved , 2017, ArXiv.

[25]  Shie Mannor,et al.  Optimizing the CVaR via Sampling , 2014, AAAI.

[26]  Jan Peters,et al.  Relative Entropy Inverse Reinforcement Learning , 2011, AISTATS.

[27]  Peter Stone,et al.  High Confidence Off-Policy Evaluation with Models , 2016, ArXiv.

[28]  Balaraman Ravindran,et al.  RAIL: Risk-Averse Imitation Learning , 2018, AAMAS.

[29]  Javier García,et al.  A comprehensive survey on safe reinforcement learning , 2015, J. Mach. Learn. Res..

[30]  Philip H. Ramsey Nonparametric Statistical Methods , 1974, Technometrics.

[31]  Bikramjit Banerjee,et al.  Exact and Heuristic Algorithms for Risk‐Aware Stochastic Physical Search , 2017, Comput. Intell..

[32]  Anind K. Dey,et al.  Maximum Entropy Inverse Reinforcement Learning , 2008, AAAI.

[33]  Chris L. Baker,et al.  Action understanding as inverse planning , 2009, Cognition.

[34]  Dana H. Ballard,et al.  Modular inverse reinforcement learning for visuomotor behavior , 2013, Biological Cybernetics.

[35]  Shie Mannor,et al.  Risk-Sensitive and Robust Decision-Making: a CVaR Optimization Approach , 2015, NIPS.

[36]  Stefano Soatto,et al.  Intent-aware long-term prediction of pedestrian motion , 2016, 2016 IEEE International Conference on Robotics and Automation (ICRA).

[37]  Stefan Schaal,et al.  Learning objective functions for manipulation , 2013, 2013 IEEE International Conference on Robotics and Automation.

[38]  A. Vries Value at Risk , 2019, Derivatives.

[39]  Andrew Y. Ng,et al.  Pharmacokinetics of a novel formulation of ivermectin after administration to goats , 2000, ICML.

[40]  Sergey Levine,et al.  Guided Cost Learning: Deep Inverse Optimal Control via Policy Optimization , 2016, ICML.

[41]  Pieter Abbeel,et al.  Apprenticeship learning via inverse reinforcement learning , 2004, ICML.

[42]  Marcello Restelli,et al.  Inverse Reinforcement Learning through Policy Gradient Minimization , 2016, AAAI.

[43]  Wlodzimierz Ogryczak,et al.  From stochastic dominance to mean-risk models: Semideviations as risk measures , 1999, Eur. J. Oper. Res..

[44]  Michael L. Littman,et al.  Apprenticeship Learning About Multiple Intentions , 2011, ICML.

[45]  S. Mendelson,et al.  A probabilistic approach to the geometry of the ℓᵨⁿ-ball , 2005, math/0503650.

[46]  Joelle Pineau,et al.  Socially Adaptive Path Planning in Human Environments Using Inverse Reinforcement Learning , 2016, Int. J. Soc. Robotics.