Meta-Q-Learning

This paper introduces Meta-Q-Learning (MQL), a new off-policy algorithm for meta-Reinforcement Learning (meta-RL). MQL builds upon three simple ideas. First, we show that Q-learning is competitive with state-of-the-art meta-RL algorithms if given access to a context variable that is a representation of the past trajectory. Second, a multi-task objective to maximize the average reward across the training tasks is an effective method to meta-train RL policies. Third, past data from the meta-training replay buffer can be recycled to adapt the policy on a new task using off-policy updates. MQL draws upon ideas in propensity estimation to do so and thereby amplifies the amount of available data for adaptation. Experiments on standard continuous-control benchmarks suggest that MQL compares favorably with the state of the art in meta-RL.

[1]  Stefano Soatto,et al.  A Baseline for Few-Shot Image Classification , 2019, ICLR.

[2]  Alexander J. Smola,et al.  P3O: Policy-on Policy-off Policy Optimization , 2019, UAI.

[3]  Subhransu Maji,et al.  Meta-Learning With Differentiable Convex Optimization , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Sergey Levine,et al.  Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Variables , 2019, ICML.

[5]  Doina Precup,et al.  Off-Policy Deep Reinforcement Learning without Exploration , 2018, ICML.

[6]  Larry Rudolph,et al.  A Closer Look at Deep Policy Gradients , 2018, ICLR.

[7]  Larry Rudolph,et al.  Are Deep Policy Gradient Algorithms Truly Policy Gradient Algorithms? , 2018, ArXiv.

[8]  Tamim Asfour,et al.  ProMP: Proximal Meta-Policy Search , 2018, ICLR.

[9]  C. Robert,et al.  Rethinking the Effective Sample Size , 2018, International Statistical Review.

[10]  Nikos Komodakis,et al.  Dynamic Few-Shot Visual Learning Without Forgetting , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[11]  Joshua Achiam,et al.  On First-Order Meta-Learning Algorithms , 2018, ArXiv.

[12]  J. Schulman,et al.  Reptile: a Scalable Metalearning Algorithm , 2018 .

[13]  Herke van Hoof,et al.  Addressing Function Approximation Error in Actor-Critic Methods , 2018, ICML.

[14]  Pieter Abbeel,et al.  Evolved Policy Gradients , 2018, NeurIPS.

[15]  Philip Bachman,et al.  Deep Reinforcement Learning that Matters , 2017, AAAI.

[16]  Richard S. Zemel,et al.  Prototypical Networks for Few-shot Learning , 2017, NIPS.

[17]  Sergey Levine,et al.  Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks , 2017, ICML.

[18]  Zeb Kurth-Nelson,et al.  Learning to reinforcement learn , 2016, CogSci.

[19]  Hugo Larochelle,et al.  Optimization as a Model for Few-Shot Learning , 2016, ICLR.

[20]  Peter L. Bartlett,et al.  RL$^2$: Fast Reinforcement Learning via Slow Reinforcement Learning , 2016, ArXiv.

[21]  David Silver,et al.  Memory-based control with recurrent neural networks , 2015, ArXiv.

[22]  Nan Jiang,et al.  Doubly Robust Off-policy Value Evaluation for Reinforcement Learning , 2015, ICML.

[23]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[24]  Peter Stone,et al.  Deep Recurrent Q-Learning for Partially Observable MDPs , 2015, AAAI Fall Symposia.

[25]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[26]  Alexander J. Smola,et al.  Doubly Robust Covariate Shift Correction , 2015, AAAI.

[27]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[28]  Guy Lever,et al.  Deterministic Policy Gradient Algorithms , 2014, ICML.

[29]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[30]  Yuval Tassa,et al.  MuJoCo: A physics engine for model-based control , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[31]  Alexander J. Smola,et al.  Linear-Time Estimators for Propensity Scores , 2011, AISTATS.

[32]  John Langford,et al.  Doubly Robust Policy Evaluation and Learning , 2011, ICML.

[33]  Neil D. Lawrence,et al.  Dataset Shift in Machine Learning , 2009 .

[34]  Marie Davidian,et al.  Comment: Demystifying Double Robustness: A Comparison of Alternative Strategies for Estimating a Population Mean from Incomplete Data. , 2008, Statistical science : a review journal of the Institute of Mathematical Statistics.

[35]  Joseph Kang,et al.  Demystifying Double Robustness: A Comparison of Alternative Strategies for Estimating a Population Mean from Incomplete Data , 2007, 0804.2973.

[36]  J. Robins,et al.  Doubly Robust Estimation in Missing Data and Causal Inference Models , 2005, Biometrics.

[37]  Sepp Hochreiter,et al.  Learning to Learn Using Gradient Descent , 2001, ICANN.

[38]  Hoon Kim,et al.  Monte Carlo Statistical Methods , 2000, Technometrics.

[39]  Jonathan Baxter,et al.  A Model of Inductive Bias Learning , 2000, J. Artif. Intell. Res..

[40]  Jürgen Schmidhuber,et al.  Shifting Inductive Bias with Success-Story Algorithm, Adaptive Levin Search, and Incremental Self-Improvement , 1997, Machine Learning.

[41]  Sebastian Thrun,et al.  Is Learning The n-th Thing Any Easier Than Learning The First? , 1995, NIPS.

[42]  Jonathan Baxter,et al.  Learning internal representations , 1995, COLT '95.

[43]  C A Nelson,et al.  Learning to Learn , 2017, Encyclopedia of Machine Learning and Data Mining.

[44]  Yoshua Bengio,et al.  On the Optimization of a Synaptic Learning Rule , 2007 .

[45]  Tom M. Mitchell,et al.  The Need for Biases in Learning Generalizations , 2007 .

[46]  Timothy J. Robinson,et al.  Sequential Monte Carlo Methods in Practice , 2003 .

[47]  Nando de Freitas,et al.  Sequential Monte Carlo Methods in Practice , 2001, Statistics for Engineering and Information Science.

[48]  S. Resnick A Probability Path , 1999 .

[49]  Paul E. Utgoff,et al.  Shift of bias for inductive concept learning , 1984 .