论文信息 - Recall Traces: Backtracking Models for Efficient Reinforcement Learning

Recall Traces: Backtracking Models for Efficient Reinforcement Learning

In many environments only a tiny subset of all states yield high reward. In these cases, few of the interactions with the environment provide a relevant learning signal. Hence, we may want to preferentially train on those high-reward states and the probable trajectories leading to them. To this end, we advocate for the use of a backtracking model that predicts the preceding states that terminate at a given high-reward state. We can train a model which, starting from a high value state (or one that is estimated to have high value), predicts and sample for which the (state, action)-tuples may have led to that high value state. These traces of (state, action) pairs, which we refer to as Recall Traces, sampled from this backtracking model starting from a high value state, are informative as they terminate in good states, and hence we can use these traces to improve a policy. We provide a variational interpretation for this idea and a practical algorithm in which the backtracking model samples from an approximate posterior distribution over trajectories which lead to large rewards. Our method improves the sample efficiency of both on- and off-policy RL algorithms across several environments and tasks.

[1] Dean Pomerleau,et al. ALVINN, an autonomous land vehicle in a neural network , 2015 .

[2] Shumeet Baluja,et al. A Method for Integrating Genetic Search Based Function Optimization and Competitive Learning , 1994 .

[3] Robert F. Stengel,et al. Optimal Control and Estimation , 1994 .

[4] Geoffrey E. Hinton,et al. The "wake-sleep" algorithm for unsupervised neural networks. , 1995, Science.

[5] Geoffrey E. Hinton,et al. Using Expectation-Maximization for Reinforcement Learning , 1997, Neural Computation.

[6] John N. Tsitsiklis,et al. Actor-Critic Algorithms , 1999, NIPS.

[7] Andrew W. Moore,et al. Prioritized Sweeping: Reinforcement Learning with Less Data and Less Time , 1993, Machine Learning.

[8] Emanuel Todorov,et al. Linearly-solvable Markov decision problems , 2006, NIPS.

[9] Anind K. Dey,et al. Maximum Entropy Inverse Reinforcement Learning , 2008, AAAI.

[10] Richard S. Sutton,et al. Sample-based learning and search with permanent and transient memories , 2008, ICML '08.

[11] Marc Toussaint,et al. Robot trajectory optimization using approximate inference , 2009, ICML '09.

[12] Carl E. Rasmussen,et al. PILCO: A Model-Based and Data-Efficient Approach to Policy Search , 2011, ICML.

[13] Gerhard Neumann,et al. Variational Inference for Policy Search in changing situations , 2011, ICML.

[14] Geoffrey J. Gordon,et al. A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning , 2010, AISTATS.

[15] J. Andrew Bagnell,et al. Agnostic System Identification for Model-Based Reinforcement Learning , 2012, ICML.

[16] Marc Toussaint,et al. On Stochastic Optimal Control and Reinforcement Learning by Approximate Inference , 2012, Robotics: Science and Systems.

[17] Vicenç Gómez,et al. Optimal control as a graphical model inference problem , 2009, Machine Learning.

[18] Yuval Tassa,et al. MuJoCo: A physics engine for model-based control , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[19] Jan Peters,et al. Reinforcement learning in robotics: A survey , 2013, Int. J. Robotics Res..

[20] Jan Peters,et al. A Survey on Policy Search for Robotics , 2013, Found. Trends Robotics.

[21] Sergey Levine,et al. Variational Policy Search via Trajectory Optimization , 2013, NIPS.

[22] Yoshua Bengio,et al. Generative Adversarial Nets , 2014, NIPS.

[23] Tom Schaul,et al. Universal Value Function Approximators , 2015, ICML.

[24] Yuval Tassa,et al. Learning Continuous Control Policies by Stochastic Value Gradients , 2015, NIPS.

[25] Sergey Levine,et al. Trust Region Policy Optimization , 2015, ICML.

[26] Shane Legg,et al. Human-level control through deep reinforcement learning , 2015, Nature.

[27] Honglak Lee,et al. Action-Conditional Video Prediction using Deep Networks in Atari Games , 2015, NIPS.

[28] Yuval Tassa,et al. Continuous control with deep reinforcement learning , 2015, ICLR.

[29] Xin Zhang,et al. End to End Learning for Self-Driving Cars , 2016, ArXiv.

[30] Alex Graves,et al. Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.