Continual Auxiliary Task Learning

Learning auxiliary tasks, such as multiple predictions about the world, can provide many benefits to reinforcement learning systems. A variety of off-policy learning algorithms have been developed to learn such predictions, but as yet there is little work on how to adapt the behavior to gather useful data for those off-policy predictions. In this work, we investigate a reinforcement learning system designed to learn a collection of auxiliary tasks, with a behavior policy learning to take actions to improve those auxiliary predictions. We highlight the inherent non-stationarity in this continual auxiliary task learning problem, for both prediction learners and the behavior learner. We develop an algorithm based on successor features that facilitates tracking under non-stationary rewards, and prove the separation into learning successor features and rewards provides convergence rate improvements. We conduct an in-depth study into the resulting multi-prediction learning system.

[1]  C. H. Honzik,et al.  Degrees of hunger, reward and non-reward, and maze learning in rats, and Introduction and removal of reward, and maze performance in rats , 1930 .

[2]  Peter Dayan,et al.  Improving Generalization for Temporal Difference Learning: The Successor Representation , 1993, Neural Computation.

[3]  Doina Precup,et al.  Eligibility Traces for Off-Policy Policy Evaluation , 2000, ICML.

[4]  Nuttapong Chentanez,et al.  Intrinsically Motivated Reinforcement Learning , 2004, NIPS.

[5]  Steven J. Bradtke,et al.  Linear Least-Squares algorithms for temporal difference learning , 2004, Machine Learning.

[6]  Long Ji Lin,et al.  Self-improving reactive agents based on reinforcement learning, planning and teaching , 1992, Machine Learning.

[7]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[8]  Paulo Martins Engel,et al.  Improving reinforcement learning with context detection , 2006, AAMAS '06.

[9]  Pierre-Yves Oudeyer,et al.  Intrinsic Motivation Systems for Autonomous Mental Development , 2007, IEEE Transactions on Evolutionary Computation.

[10]  Aurélien Garivier,et al.  On Upper-Confidence Bound Policies for Non-Stationary Bandit Problems , 2008, 0805.3415.

[11]  Jürgen Schmidhuber,et al.  Driven by Compression Progress: A Simple Principle Explains Essential Aspects of Subjective Beauty, Novelty, Surprise, Interestingness, Attention, Curiosity, Creativity, Art, Science, Music, Jokes , 2008, ABiALS.

[12]  A. S. Xanthopoulos,et al.  Reinforcement learning and evolutionary algorithms for non-stationary multi-armed bandit problems , 2008, Appl. Math. Comput..

[13]  J. Zico Kolter,et al.  The Fixed Points of Off-Policy TD , 2011, NIPS.

[14]  Patrick M. Pilarski,et al.  Horde: a scalable real-time architecture for learning knowledge from unsupervised sensorimotor interaction , 2011, AAMAS.

[15]  Patrick M. Pilarski,et al.  Tuning-free step-size adaptation , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  Tom Schaul,et al.  Better Generalization with Forecasts , 2013, IJCAI.

[17]  Omar Besbes,et al.  Stochastic Multi-Armed-Bandit Problem with Non-stationary Rewards , 2014, NIPS.

[18]  Richard S. Sutton,et al.  Multi-timescale nexting in a reinforcement learning robot , 2011, Adapt. Behav..

[19]  Sergey Levine,et al.  Incentivizing Exploration In Reinforcement Learning With Deep Predictive Models , 2015, ArXiv.

[20]  Tom Schaul,et al.  Universal Value Function Approximators , 2015, ICML.

[21]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[22]  Bruno Scherrer,et al.  Improved and Generalized Upper Bounds on the Complexity of Policy Iteration , 2013, Math. Oper. Res..

[23]  Martha White,et al.  An Emphatic Approach to the Problem of Off-policy Temporal-Difference Learning , 2015, J. Mach. Learn. Res..

[24]  Sherief Abdallah,et al.  Addressing Environment Non-Stationarity by Repeating Q-learning Updates , 2016, J. Mach. Learn. Res..

[25]  Martha White,et al.  Unifying Task Specification in Reinforcement Learning , 2016, ICML.

[26]  Alexei A. Efros,et al.  Curiosity-Driven Exploration by Self-Supervised Prediction , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[27]  Marcin Andrychowicz,et al.  Hindsight Experience Replay , 2017, NIPS.

[28]  Marlos C. Machado,et al.  A Laplacian Framework for Option Discovery in Reinforcement Learning , 2017, ICML.

[29]  Richard S. Sutton,et al.  Multi-step Off-policy Learning Without Importance Sampling Ratios , 2017, ArXiv.

[30]  Daan Wierstra,et al.  Variational Intrinsic Control , 2016, ICLR.

[31]  Shie Mannor,et al.  Consistent On-Line Off-Policy Evaluation , 2017, ICML.

[32]  Razvan Pascanu,et al.  Learning to Navigate in Complex Environments , 2016, ICLR.

[33]  Tom Schaul,et al.  Reinforcement Learning with Unsupervised Auxiliary Tasks , 2016, ICLR.

[34]  Tom Schaul,et al.  Successor Features for Transfer in Reinforcement Learning , 2016, NIPS.

[35]  Tom Schaul,et al.  Rainbow: Combining Improvements in Deep Reinforcement Learning , 2017, AAAI.

[36]  Marek Petrik,et al.  Proximal Gradient Temporal Difference Learning: Stable Reinforcement Learning with Polynomial Sample Complexity , 2018, J. Artif. Intell. Res..

[37]  Martin A. Riedmiller,et al.  Learning by Playing - Solving Sparse Reward Tasks from Scratch , 2018, ICML.

[38]  Martha White,et al.  Online Off-policy Prediction , 2018, ArXiv.

[39]  Qiang Liu,et al.  Breaking the Curse of Horizon: Infinite-Horizon Off-Policy Estimation , 2018, NeurIPS.

[40]  Martha White,et al.  High-confidence error estimates for learned value functions , 2018, UAI.

[41]  Tom Schaul,et al.  Transfer in Deep Reinforcement Learning Using Successor Features and Generalised Policy Improvement , 2018, ICML.

[42]  Martha White,et al.  Meta-descent for Online, Continual Prediction , 2019, AAAI.

[43]  Pierre-Yves Oudeyer,et al.  CURIOUS: Intrinsically Motivated Modular Multi-Goal Reinforcement Learning , 2018, ICML.

[44]  Alessandro Lazaric,et al.  Rotting bandits are no harder than stochastic ones , 2018, AISTATS.

[45]  Amos J. Storkey,et al.  Exploration by Random Network Distillation , 2018, ICLR.

[46]  Francesco Orabona A Modern Introduction to Online Learning , 2019, ArXiv.

[47]  Doina Precup,et al.  The Option Keyboard: Combining Skills in Reinforcement Learning , 2021, NeurIPS.

[48]  Sergey Levine,et al.  Diversity is All You Need: Learning Skills without a Reward Function , 2018, ICLR.

[49]  Richard L. Lewis,et al.  Discovery of Useful Questions as Auxiliary Tasks , 2019, NeurIPS.

[50]  Tom Schaul,et al.  Universal Successor Features Approximators , 2018, ICLR.

[51]  Yao Liu,et al.  Off-Policy Policy Gradient with Stationary Distribution Correction , 2019, UAI.

[52]  Sergey Levine,et al.  Contextual Imagined Goals for Self-Supervised Robotic Learning , 2019, CoRL.

[53]  Matthew E. Taylor,et al.  Terminal Prediction as an Auxiliary Task for Deep Reinforcement Learning , 2019, AIIDE.

[54]  Alexei A. Efros,et al.  Large-Scale Study of Curiosity-Driven Learning , 2018, ICLR.

[55]  Sergey Levine,et al.  Search on the Replay Buffer: Bridging Planning and Reinforcement Learning , 2019, NeurIPS.

[56]  Daniel Guo,et al.  Never Give Up: Learning Directed Exploration Strategies , 2020, ICLR.

[57]  Adam White,et al.  Adapting Behavior via Intrinsic Reward: A Survey and Empirical Study , 2020, J. Artif. Intell. Res..

[58]  Sridhar Mahadevan,et al.  Optimizing for the Future in Non-Stationary MDPs , 2020, ICML.

[59]  David Simchi-Levi,et al.  Reinforcement Learning for Non-Stationary Markov Decision Processes: The Blessing of (More) Optimism , 2020, ICML.

[60]  Daniel Yamins,et al.  Active World Model Learning with Progress Curiosity , 2020, ICML.

[61]  Jimmy Ba,et al.  Maximum Entropy Gain Exploration for Long Horizon Multi-goal Reinforcement Learning , 2020, ICML.

[62]  Sergey Levine,et al.  Skew-Fit: State-Covering Self-Supervised Reinforcement Learning , 2019, ICML.

[63]  Doina Precup,et al.  Fast reinforcement learning with generalized policy updates , 2020, Proceedings of the National Academy of Sciences.

[64]  Chelsea Finn,et al.  Long-Horizon Visual Planning with Goal-Conditioned Hierarchical Predictors , 2020, NeurIPS.

[65]  Scott M. Jordan,et al.  Towards Safe Policy Improvement for Non-Stationary MDPs , 2020, NeurIPS.

[66]  Shalabh Bhatnagar,et al.  Reinforcement Learning in Non-Stationary Environments , 2019, ArXiv.

[67]  Brendan O'Donoghue,et al.  Discovering Diverse Nearly Optimal Policies withSuccessor Features , 2021, ArXiv.

[68]  Adam White,et al.  A Generalized Projected Bellman Error for Off-policy Value Estimation in Reinforcement Learning , 2021, J. Mach. Learn. Res..

[69]  Martha White,et al.  General Value Function Networks , 2018, J. Artif. Intell. Res..

[70]  Junhyuk Oh,et al.  Discovery of Options via Meta-Learned Subgoals , 2021, NeurIPS.