Interpretable Multi-Objective Reinforcement Learning through Policy Orchestration

Autonomous cyber-physical agents and systems play an increasingly large role in our lives. To ensure that agents behave in ways aligned with the values of the societies in which they operate, we must develop techniques that allow these agents to not only maximize their reward in an environment, but also to learn and follow the implicit constraints of society. These constraints and norms can come from any number of sources including regulations, business process guidelines, laws, ethical principles, social norms, and moral values. We detail a novel approach that uses inverse reinforcement learning to learn a set of unspecified constraints from demonstrations of the task, and reinforcement learning to learn to maximize the environment rewards. More precisely, we assume that an agent can observe traces of behavior of members of the society but has no access to the explicit set of constraints that give rise to the observed behavior. Inverse reinforcement learning is used to learn such constraints, that are then combined with a possibly orthogonal value function through the use of a contextual bandit-based orchestrator that picks a contextually-appropriate choice between the two policies (constraint-based and environment reward-based) when taking actions. The contextual bandit orchestrator allows the agent to mix policies in novel ways, taking the best actions from either a reward maximizing or constrained policy. In addition, the orchestrator is transparent on which policy is being employed at each time step. We test our algorithms using a Pac-Man domain and show that the agent is able to learn to act optimally, act within the demonstrated constraints, and mix these two functions in complex ways.

[1]  Francesca Rossi,et al.  Incorporating Behavioral Constraints in Online AI Systems , 2018, AAAI.

[2]  Chunyan Miao,et al.  Building Ethics into Artificial Intelligence , 2018, IJCAI.

[3]  Andreas Theodorou,et al.  Why is my robot behaving like that? Designing transparency for real time inspection of autonomous robots , 2016 .

[4]  C. Allen,et al.  Moral Machines: Teaching Robots Right from Wrong , 2008 .

[5]  Francesca Rossi,et al.  Preferences and Ethical Principles in Decision Making , 2018, AAAI Spring Symposia.

[6]  Andrea Loreggia,et al.  Value Alignment via Tractable Preference Distance , 2018, Artificial Intelligence Safety and Security.

[7]  Matthias Scheutz,et al.  Value Alignment or Misalignment - What Will Keep Systems Accountable? , 2017, AAAI Workshops.

[8]  Marek Petrik,et al.  Interpretable Policies for Dynamic Product Recommendations , 2016, UAI.

[9]  John Schulman,et al.  Concrete Problems in AI Safety , 2016, ArXiv.

[10]  C. Allen,et al.  Artificial Morality: Top-down, Bottom-up, and Hybrid Approaches , 2005, Ethics and Information Technology.

[11]  Shipra Agrawal,et al.  Thompson Sampling for Contextual Bandits with Linear Payoffs , 2012, ICML.

[12]  Stuart J. Russell,et al.  Research Priorities for Robust and Beneficial Artificial Intelligence , 2015, AI Mag..

[13]  Francesca Rossi,et al.  Using Contextual Bandits with Behavioral Constraints for Constrained Online Movie Recommendation , 2018, IJCAI.

[14]  Iyad Rahwan,et al.  A Voting-Based System for Ethical Decision Making , 2017, AAAI.

[15]  Laurent Orseau,et al.  AI Safety Gridworlds , 2017, ArXiv.

[16]  Romain Laroche,et al.  Reinforcement Learning Algorithm Selection , 2017, ICLR.

[17]  J. Langford,et al.  The Epoch-Greedy algorithm for contextual multi-armed bandits , 2007, NIPS 2007.

[18]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[19]  Abhinav Verma,et al.  Programmatically Interpretable Reinforcement Learning , 2018, ICML.

[20]  Oliver Schulte,et al.  Toward Interpretable Deep Reinforcement Learning with Linear Model U-Trees , 2018, ECML/PKDD.

[21]  Shou-De Lin,et al.  A Low-Cost Ethics Shaping Approach for Designing Reinforcement Learning Agents , 2017, AAAI.

[22]  Amartya Sen,et al.  Choice, Ordering and Morality , 1974 .

[23]  Andrew Y. Ng,et al.  Pharmacokinetics of a novel formulation of ivermectin after administration to goats , 2000, ICML.

[24]  Pieter Abbeel,et al.  Apprenticeship learning via inverse reinforcement learning , 2004, ICML.

[25]  Dan Ventura,et al.  Ethics as Aesthetic: A Computational Creativity Approach to Ethical Behavior , 2018, ICCC.

[26]  Richard S. Sutton,et al.  Introduction to Reinforcement Learning , 1998 .

[27]  Iyad Rahwan,et al.  The social dilemma of autonomous vehicles , 2015, Science.

[28]  Michael Anderson,et al.  Machine Ethics , 2011 .