论文信息 - Continual Auxiliary Task Learning

Continual Auxiliary Task Learning

Learning auxiliary tasks, such as multiple predictions about the world, can provide many benefits to reinforcement learning systems. A variety of off-policy learning algorithms have been developed to learn such predictions, but as yet there is little work on how to adapt the behavior to gather useful data for those off-policy predictions. In this work, we investigate a reinforcement learning system designed to learn a collection of auxiliary tasks, with a behavior policy learning to take actions to improve those auxiliary predictions. We highlight the inherent non-stationarity in this continual auxiliary task learning problem, for both prediction learners and the behavior learner. We develop an algorithm based on successor features that facilitates tracking under non-stationary rewards, and prove the separation into learning successor features and rewards provides convergence rate improvements. We conduct an in-depth study into the resulting multi-prediction learning system.

[1] C. H. Honzik,et al. Degrees of hunger, reward and non-reward, and maze learning in rats, and Introduction and removal of reward, and maze performance in rats , 1930 .

[2] Peter Dayan,et al. Improving Generalization for Temporal Difference Learning: The Successor Representation , 1993, Neural Computation.

[3] Doina Precup,et al. Eligibility Traces for Off-Policy Policy Evaluation , 2000, ICML.

[4] Nuttapong Chentanez,et al. Intrinsically Motivated Reinforcement Learning , 2004, NIPS.

[5] Steven J. Bradtke,et al. Linear Least-Squares algorithms for temporal difference learning , 2004, Machine Learning.

[6] Long Ji Lin,et al. Self-improving reactive agents based on reinforcement learning, planning and teaching , 1992, Machine Learning.

[7] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[8] Paulo Martins Engel,et al. Improving reinforcement learning with context detection , 2006, AAMAS '06.

[9] Pierre-Yves Oudeyer,et al. Intrinsic Motivation Systems for Autonomous Mental Development , 2007, IEEE Transactions on Evolutionary Computation.

[10] Aurélien Garivier,et al. On Upper-Confidence Bound Policies for Non-Stationary Bandit Problems , 2008, 0805.3415.

[11] Jürgen Schmidhuber,et al. Driven by Compression Progress: A Simple Principle Explains Essential Aspects of Subjective Beauty, Novelty, Surprise, Interestingness, Attention, Curiosity, Creativity, Art, Science, Music, Jokes , 2008, ABiALS.

[12] A. S. Xanthopoulos,et al. Reinforcement learning and evolutionary algorithms for non-stationary multi-armed bandit problems , 2008, Appl. Math. Comput..

[13] J. Zico Kolter,et al. The Fixed Points of Off-Policy TD , 2011, NIPS.

[14] Patrick M. Pilarski,et al. Horde: a scalable real-time architecture for learning knowledge from unsupervised sensorimotor interaction , 2011, AAMAS.

[15] Patrick M. Pilarski,et al. Tuning-free step-size adaptation , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16] Tom Schaul,et al. Better Generalization with Forecasts , 2013, IJCAI.

[17] Omar Besbes,et al. Stochastic Multi-Armed-Bandit Problem with Non-stationary Rewards , 2014, NIPS.

[18] Richard S. Sutton,et al. Multi-timescale nexting in a reinforcement learning robot , 2011, Adapt. Behav..

[19] Sergey Levine,et al. Incentivizing Exploration In Reinforcement Learning With Deep Predictive Models , 2015, ArXiv.

[20] Tom Schaul,et al. Universal Value Function Approximators , 2015, ICML.

[21] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[22] Bruno Scherrer,et al. Improved and Generalized Upper Bounds on the Complexity of Policy Iteration , 2013, Math. Oper. Res..

[23] Martha White,et al. An Emphatic Approach to the Problem of Off-policy Temporal-Difference Learning , 2015, J. Mach. Learn. Res..

[24] Sherief Abdallah,et al. Addressing Environment Non-Stationarity by Repeating Q-learning Updates , 2016, J. Mach. Learn. Res..

[25] Martha White,et al. Unifying Task Specification in Reinforcement Learning , 2016, ICML.

[26] Alexei A. Efros,et al. Curiosity-Driven Exploration by Self-Supervised Prediction , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[27] Marcin Andrychowicz,et al. Hindsight Experience Replay , 2017, NIPS.