论文信息 - Meta-Q-Learning

Meta-Q-Learning

This paper introduces Meta-Q-Learning (MQL), a new off-policy algorithm for meta-Reinforcement Learning (meta-RL). MQL builds upon three simple ideas. First, we show that Q-learning is competitive with state-of-the-art meta-RL algorithms if given access to a context variable that is a representation of the past trajectory. Second, a multi-task objective to maximize the average reward across the training tasks is an effective method to meta-train RL policies. Third, past data from the meta-training replay buffer can be recycled to adapt the policy on a new task using off-policy updates. MQL draws upon ideas in propensity estimation to do so and thereby amplifies the amount of available data for adaptation. Experiments on standard continuous-control benchmarks suggest that MQL compares favorably with the state of the art in meta-RL.

Alex Smola | P. Chaudhari | Stefano Soatto | Rasool Fakoor

[1] Stefano Soatto,et al. A Baseline for Few-Shot Image Classification , 2019, ICLR.

[2] Alexander J. Smola,et al. P3O: Policy-on Policy-off Policy Optimization , 2019, UAI.

[3] Subhransu Maji,et al. Meta-Learning With Differentiable Convex Optimization , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[4] Sergey Levine,et al. Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Variables , 2019, ICML.

[5] Doina Precup,et al. Off-Policy Deep Reinforcement Learning without Exploration , 2018, ICML.

[6] Larry Rudolph,et al. A Closer Look at Deep Policy Gradients , 2018, ICLR.

[7] Larry Rudolph,et al. Are Deep Policy Gradient Algorithms Truly Policy Gradient Algorithms? , 2018, ArXiv.

[8] Tamim Asfour,et al. ProMP: Proximal Meta-Policy Search , 2018, ICLR.

[9] C. Robert,et al. Rethinking the Effective Sample Size , 2018, International Statistical Review.

[10] Nikos Komodakis,et al. Dynamic Few-Shot Visual Learning Without Forgetting , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[11] Joshua Achiam,et al. On First-Order Meta-Learning Algorithms , 2018, ArXiv.

[12] J. Schulman,et al. Reptile: a Scalable Metalearning Algorithm , 2018 .

[13] Herke van Hoof,et al. Addressing Function Approximation Error in Actor-Critic Methods , 2018, ICML.

[14] Pieter Abbeel,et al. Evolved Policy Gradients , 2018, NeurIPS.

[15] Philip Bachman,et al. Deep Reinforcement Learning that Matters , 2017, AAAI.

[16] Richard S. Zemel,et al. Prototypical Networks for Few-shot Learning , 2017, NIPS.

[17] Sergey Levine,et al. Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks , 2017, ICML.

[18] Zeb Kurth-Nelson,et al. Learning to reinforcement learn , 2016, CogSci.

[19] Hugo Larochelle,et al. Optimization as a Model for Few-Shot Learning , 2016, ICLR.

[20] Peter L. Bartlett,et al. RL$^2$: Fast Reinforcement Learning via Slow Reinforcement Learning , 2016, ArXiv.

[21] David Silver,et al. Memory-based control with recurrent neural networks , 2015, ArXiv.

[22] Nan Jiang,et al. Doubly Robust Off-policy Value Evaluation for Reinforcement Learning , 2015, ICML.

[23] Yuval Tassa,et al. Continuous control with deep reinforcement learning , 2015, ICLR.

[24] Peter Stone,et al. Deep Recurrent Q-Learning for Partially Observable MDPs , 2015, AAAI Fall Symposia.

[25] Sergey Levine,et al. Trust Region Policy Optimization , 2015, ICML.

[26] Alexander J. Smola,et al. Doubly Robust Covariate Shift Correction , 2015, AAAI.

[27] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[28] Guy Lever,et al. Deterministic Policy Gradient Algorithms , 2014, ICML.

[29] Yoshua Bengio,et al. Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[30] Yuval Tassa,et al. MuJoCo: A physics engine for model-based control , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[31] Alexander J. Smola,et al. Linear-Time Estimators for Propensity Scores , 2011, AISTATS.

[32] John Langford,et al. Doubly Robust Policy Evaluation and Learning , 2011, ICML.

[33] Neil D. Lawrence,et al. Dataset Shift in Machine Learning , 2009 .

[34] Marie Davidian,et al. Comment: Demystifying Double Robustness: A Comparison of Alternative Strategies for Estimating a Population Mean from Incomplete Data. , 2008, Statistical science : a review journal of the Institute of Mathematical Statistics.

[35] Joseph Kang,et al. Demystifying Double Robustness: A Comparison of Alternative Strategies for Estimating a Population Mean from Incomplete Data , 2007, 0804.2973.

[36] J. Robins,et al. Doubly Robust Estimation in Missing Data and Causal Inference Models , 2005, Biometrics.

[37] Sepp Hochreiter,et al. Learning to Learn Using Gradient Descent , 2001, ICANN.

[38] Hoon Kim,et al. Monte Carlo Statistical Methods , 2000, Technometrics.

[39] Jonathan Baxter,et al. A Model of Inductive Bias Learning , 2000, J. Artif. Intell. Res..

[40] Jürgen Schmidhuber,et al. Shifting Inductive Bias with Success-Story Algorithm, Adaptive Levin Search, and Incremental Self-Improvement , 1997, Machine Learning.

[41] Sebastian Thrun,et al. Is Learning The n-th Thing Any Easier Than Learning The First? , 1995, NIPS.

[42] Jonathan Baxter,et al. Learning internal representations , 1995, COLT '95.

[43] C A Nelson,et al. Learning to Learn , 2017, Encyclopedia of Machine Learning and Data Mining.

[44] Yoshua Bengio,et al. On the Optimization of a Synaptic Learning Rule , 2007 .

[45] Tom M. Mitchell,et al. The Need for Biases in Learning Generalizations , 2007 .

[46] Timothy J. Robinson,et al. Sequential Monte Carlo Methods in Practice , 2003 .

[47] Nando de Freitas,et al. Sequential Monte Carlo Methods in Practice , 2001, Statistics for Engineering and Information Science.

[48] S. Resnick. A Probability Path , 1999 .

[49] Paul E. Utgoff,et al. Shift of bias for inductive concept learning , 1984 .