Safe Policy Learning from Observations

In this paper, we consider the problem of learning a policy by observing numerous non-expert agents. Our goal is to extract a policy that, with high-confidence, acts better than the average agents' performance. Such a setting is important for real-world problems where expert data is scarce but non-expert data can easily be obtained, e.g. by crowdsourcing. Our approach is to pose this problem as safe policy improvement in Reinforcement Learning. First, we evaluate an average behavior policy and approximate its value function. Then, we develop a stochastic policy improvement algorithm, termed Rerouted Behavior Improvement (RBI), that safely improves the average behavior. The primary advantages of RBI over current safe learning methods are its stability in the presence of value estimation errors and the elimination of a policy search process. We demonstrate these advantages in a Taxi grid-world domain and in four games from the Atari learning environment.

[1]  Jan Peters,et al.  Reinforcement learning in robotics: A survey , 2013, Int. J. Robotics Res..

[2]  Marek Petrik,et al.  Safe Policy Improvement by Minimizing Robust Baseline Regret , 2016, NIPS.

[3]  Matthew E. Taylor,et al.  Pre-training Neural Networks with Human Demonstrations for Deep Reinforcement Learning , 2017, ArXiv.

[4]  Tom Schaul,et al.  Deep Q-learning From Demonstrations , 2017, AAAI.

[5]  Tom Schaul,et al.  Successor Features for Transfer in Reinforcement Learning , 2016, NIPS.

[6]  Tom Schaul,et al.  Unifying Count-Based Exploration and Intrinsic Motivation , 2016, NIPS.

[7]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[8]  Amnon Shashua,et al.  Safe, Multi-Agent, Reinforcement Learning for Autonomous Driving , 2016, ArXiv.

[9]  Tom Schaul,et al.  Dueling Network Architectures for Deep Reinforcement Learning , 2015, ICML.

[10]  Katja Hofmann,et al.  The Atari Grand Challenge Dataset , 2017, ArXiv.

[11]  David Budden,et al.  Distributed Prioritized Experience Replay , 2018, ICLR.

[12]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[13]  Philip Bachman,et al.  Deep Reinforcement Learning that Matters , 2017, AAAI.

[14]  Andreas Krause,et al.  Contextual Gaussian Process Bandit Optimization , 2011, NIPS.

[15]  Sonia Chernova,et al.  Integrating reinforcement learning with human demonstrations of varying ability , 2011, AAMAS.

[16]  Tom Schaul,et al.  Prioritized Experience Replay , 2015, ICLR.

[17]  Jeffrey S. Racine,et al.  Consistent cross-validatory model-selection for dependent data: hv-block cross-validation , 2000 .

[18]  John Langford,et al.  Approximately Optimal Approximate Reinforcement Learning , 2002, ICML.

[19]  Brett Browning,et al.  A survey of robot learning from demonstration , 2009, Robotics Auton. Syst..

[20]  Luca Bascetta,et al.  Adaptive Step-Size for Policy Gradient Methods , 2013, NIPS.

[21]  Marc G. Bellemare,et al.  Safe and Efficient Off-Policy Reinforcement Learning , 2016, NIPS.

[22]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[23]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[24]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[25]  David K. Smith,et al.  Dynamic Programming and Optimal Control. Volume 1 , 1996 .

[26]  Demis Hassabis,et al.  Mastering the game of Go without human knowledge , 2017, Nature.

[27]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[28]  Romain Laroche,et al.  Safe Policy Improvement with Baseline Bootstrapping , 2017, ICML.

[29]  Thomas G. Dietterich Hierarchical Reinforcement Learning with the MAXQ Value Function Decomposition , 1999, J. Artif. Intell. Res..

[30]  Yiming Zhang,et al.  Supervised Policy Update for Deep Reinforcement Learning , 2018, ICLR.

[31]  Martha White,et al.  Linear Off-Policy Actor-Critic , 2012, ICML.

[32]  Sergey Levine,et al.  Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor , 2018, ICML.

[33]  Philip S. Thomas,et al.  High Confidence Policy Improvement , 2015, ICML.

[34]  Rémi Munos,et al.  Observe and Look Further: Achieving Consistent Performance on Atari , 2018, ArXiv.

[35]  Guy Lever,et al.  Deterministic Policy Gradient Algorithms , 2014, ICML.

[36]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[37]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[38]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[39]  Tom Schaul,et al.  Rainbow: Combining Improvements in Deep Reinforcement Learning , 2017, AAAI.

[40]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[41]  Javier García,et al.  A comprehensive survey on safe reinforcement learning , 2015, J. Mach. Learn. Res..

[42]  Sergey Levine,et al.  Learning Neural Network Policies with Guided Policy Search under Unknown Dynamics , 2014, NIPS.

[43]  Gongping Yang,et al.  On the Class Imbalance Problem , 2008, 2008 Fourth International Conference on Natural Computation.

[44]  Shie Mannor,et al.  Scaling Up Robust MDPs by Reinforcement Learning , 2013, ArXiv.