Minimax-Regret Querying on Side Effects for Safe Optimality in Factored Markov Decision Processes

As it achieves a goal on behalf of its human user, an autonomous agent’s actions may have side effects that change features of its environment in ways that negatively surprise its user. An agent that can be trusted to operate safely should thus only change features the user has explicitly permitted. We formalize this problem, and develop a planning algorithm that avoids potentially negative side effects given what the agent knows about (un)changeable features. Further, we formulate a provably minimax-regret querying strategy for the agent to selectively ask the user about features that it hasn’t explicitly been told about. We empirically show how much faster it is than a more exhaustive approach and how much better its queries are than those found by the best known heuristic.

[1]  Pieter Abbeel,et al.  Safe Exploration in Markov Decision Processes , 2012, ICML.

[2]  Nan Jiang,et al.  Repeated Inverse Reinforcement Learning , 2017, NIPS.

[3]  Craig Boutilier,et al.  Robust Policy Computation in Reward-Uncertain MDPs Using Nondominated Policies , 2010, AAAI.

[4]  Anca D. Dragan,et al.  Inverse Reward Design , 2017, NIPS.

[5]  Edmund H. Durfee,et al.  Comparing Action-Query Strategies in Semi-Autonomous Agents , 2011, AAAI.

[6]  Mausam,et al.  A Theory of Goal-Oriented MDPs with Dead Ends , 2012, UAI.

[7]  Craig Boutilier,et al.  Decision-Theoretic Planning: Structural Assumptions and Computational Leverage , 1999, J. Artif. Intell. Res..

[8]  Anca D. Dragan,et al.  SHIV: Reducing supervisor burden in DAgger using support vectors for efficient learning from demonstrations in high dimensional state spaces , 2016, 2016 IEEE International Conference on Robotics and Automation (ICRA).

[9]  John Schulman,et al.  Concrete Problems in AI Safety , 2016, ArXiv.

[10]  Craig Boutilier,et al.  Regret-based optimal recommendation sets in conversational recommender systems , 2009, RecSys '09.

[11]  Laurent El Ghaoui,et al.  Robust Control of Markov Decision Processes with Uncertain Transition Matrices , 2005, Oper. Res..

[12]  Bruno Zanuttini,et al.  Interactive Value Iteration for Markov Decision Processes with Unknown Rewards , 2013, IJCAI.

[13]  Laurent Orseau,et al.  AI Safety Gridworlds , 2017, ArXiv.

[14]  Pieter Abbeel,et al.  Constrained Policy Optimization , 2017, ICML.

[15]  Edmund H. Durfee,et al.  Symmetric approximate linear programming for factored MDPs with application to constrained problems , 2006, Annals of Mathematics and Artificial Intelligence.

[16]  Florent Teichteil-Königsbuch Stochastic Safest and Shortest Path Problems , 2012, AAAI.

[17]  Edmund H. Durfee,et al.  Influence-Based Policy Abstraction for Weakly-Coupled Dec-POMDPs , 2010, ICAPS.

[18]  Anca D. Dragan,et al.  Should Robots be Obedient? , 2017, IJCAI.

[19]  Craig Boutilier,et al.  Regret-based Reward Elicitation for Markov Decision Processes , 2009, UAI.

[20]  Steffen Udluft,et al.  Safe exploration for reinforcement learning , 2008, ESANN.

[21]  Javier García,et al.  A comprehensive survey on safe reinforcement learning , 2015, J. Mach. Learn. Res..

[22]  Edmund H. Durfee,et al.  Approximately-Optimal Queries for Planning in Reward-Uncertain Markov Decision Processes , 2017, ICAPS.