Deep Reinforcement Learning from Human Preferences

For sophisticated reinforcement learning (RL) systems to interact usefully with real-world environments, we need to communicate complex goals to these systems. In this work, we explore goals defined in terms of (non-expert) human preferences between pairs of trajectory segments. We show that this approach can effectively solve complex RL tasks without access to the reward function, including Atari games and simulated robot locomotion, while providing feedback on less than one percent of our agent's interactions with the environment. This reduces the cost of human oversight far enough that it can be practically applied to state-of-the-art RL systems. To demonstrate the flexibility of our approach, we show that we can successfully train complex novel behaviors with about an hour of human time. These behaviors and environments are considerably more complex than any that have been previously learned from human feedback.

[1]  Farbod Fahimi,et al.  Online human training of a myoelectric prosthesis controller via actor-critic reinforcement learning , 2011, 2011 IEEE International Conference on Rehabilitation Robotics.

[2]  Oliver Kroemer,et al.  Active Reward Learning , 2014, Robotics: Science and Systems.

[3]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[4]  Martín Abadi,et al.  TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems , 2016, ArXiv.

[5]  Yuval Tassa,et al.  MuJoCo: A physics engine for model-based control , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[6]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[7]  Stuart Russell Should We Fear Supersmart Robots? , 2016, Scientific American.

[8]  Johannes Fürnkranz,et al.  Model-Free Preference-Based Reinforcement Learning , 2016, AAAI.

[9]  Oliver Kroemer,et al.  Active reward learning with a novel acquisition function , 2015, Auton. Robots.

[10]  R. Shepard Stimulus and response generalization: A stochastic model relating generalization to distance in psychological space , 1957 .

[11]  Peter Stone,et al.  Learning non-myopically from human-generated reward , 2013, IUI '13.

[12]  Christopher D. Manning,et al.  Learning Language Games through Interaction , 2016, ACL.

[13]  R. A. Bradley,et al.  Rank Analysis of Incomplete Block Designs: I. The Method of Paired Comparisons , 1952 .

[14]  Tom Schaul,et al.  Deep Q-learning From Demonstrations , 2017, AAAI.

[15]  Michèle Sebag,et al.  Programming by Feedback , 2014, ICML.

[16]  Sergey Levine,et al.  Generalizing Skills with Semi-Supervised Reinforcement Learning , 2016, ICLR.

[17]  R. Duncan Luce,et al.  Individual Choice Behavior: A Theoretical Analysis , 1979 .

[18]  Michèle Sebag,et al.  APRIL: Active Preference-learning based Reinforcement Learning , 2012, ECML/PKDD.

[19]  John Schulman,et al.  Concrete Problems in AI Safety , 2016, ArXiv.

[20]  Pieter Abbeel,et al.  Third-Person Imitation Learning , 2017, ICLR.

[21]  Alan Fern,et al.  A Bayesian Approach for Policy Learning from Trajectory Preference Queries , 2012, NIPS.

[22]  Alex Graves,et al.  Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.

[23]  Michèle Sebag,et al.  Preference-Based Policy Learning , 2011, ECML/PKDD.

[24]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[25]  Tom Schaul,et al.  Learning from Demonstrations for Real World Reinforcement Learning , 2017, ArXiv.

[26]  A. Elo The rating of chessplayers, past and present , 1978 .

[27]  Risi Sebastian,et al.  Breeding a diversity of Super Mario behaviors through interactive evolution , 2016 .

[28]  Nick Bostrom,et al.  Superintelligence: Paths, Dangers, Strategies , 2014 .

[29]  Guan Wang,et al.  Interactive Learning from Policy-Dependent Human Feedback , 2017, ICML.

[30]  Romain Laroche,et al.  Score-based Inverse Reinforcement Learning , 2016, AAMAS.

[31]  I. C. Parmee,et al.  INTRODUCING MACHINE LEARNING WITHIN AN INTERACTIVE EVOLUTIONARY DESIGN ENVIRONMENT , 2006 .

[32]  R. A. Bradley,et al.  RANK ANALYSIS OF INCOMPLETE BLOCK DESIGNS , 1952 .

[33]  Johannes Fürnkranz,et al.  Preference-Based Reinforcement Learning: A Preliminary Survey , 2013 .

[34]  Alex Graves,et al.  Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[35]  Peter Stone,et al.  Interactively shaping agents via human reinforcement: the TAMER framework , 2009, K-CAP '09.

[36]  Marc G. Bellemare,et al.  The Arcade Learning Environment: An Evaluation Platform for General Agents (Extended Abstract) , 2012, IJCAI.

[37]  Anca D. Dragan,et al.  Cooperative Inverse Reinforcement Learning , 2016, NIPS.

[38]  Jimmy Secretan,et al.  Picbreeder: evolving pictures collaboratively online , 2008, CHI.

[39]  Hiroaki Sugiyama,et al.  Preference-learning based Inverse Reinforcement Learning for Dialog Control , 2012, INTERSPEECH.

[40]  Nando de Freitas,et al.  A Bayesian interactive optimization approach to procedural animation design , 2010, SCA '10.

[41]  Eyke Hüllermeier,et al.  Preference-based reinforcement learning: a formal framework and a policy iteration algorithm , 2012, Mach. Learn..

[42]  Stefano Ermon,et al.  Generative Adversarial Imitation Learning , 2016, NIPS.

[43]  Andrew Y. Ng,et al.  Pharmacokinetics of a novel formulation of ivermectin after administration to goats , 2000, ICML.

[44]  Sergey Levine,et al.  Guided Cost Learning: Deep Inverse Optimal Control via Policy Optimization , 2016, ICML.

[45]  W. Bradley Knox,et al.  Learning from human-generated reward , 2012 .