论文信息 - Alchemy: A structured task distribution for meta-reinforcement learning

Alchemy: A structured task distribution for meta-reinforcement learning

There has been rapidly growing interest in metalearning as a method for increasing the flexibility and sample efficiency of reinforcement learning. One problem in this area of research, however, has been a scarcity of adequate benchmark tasks. In general, the structure underlying past benchmarks has either been too simple to be inherently interesting, or too ill-defined to support principled analysis. In the present work, we introduce a new benchmark for meta-RL research, which combines structural richness with structural transparency. Alchemy is a 3D video game, implemented in Unity, which involves a latent causal structure that is resampled procedurally from episode to episode, affording structure learning, online inference, hypothesis testing and action sequencing based on abstract domain knowledge. We evaluate a pair of powerful RL agents on Alchemy and present an in-depth analysis of one of these agents. Results clearly indicate a frank and specific failure of meta-learning, providing validation for Alchemy as a challenging benchmark for meta-RL. Concurrent with this report, we are releasing Alchemy as public resource, together with a suite of analysis tools and sample agent trajectories.

[1] C A Nelson,et al. Learning to Learn , 2017, Encyclopedia of Machine Learning and Data Mining.

[2] Jonathan Baxter,et al. Theoretical Models of Learning to Learn , 1998, Learning to Learn.

[3] Marc G. Bellemare,et al. The Arcade Learning Environment: An Evaluation Platform for General Agents , 2012, J. Artif. Intell. Res..

[4] Thomas L. Griffiths,et al. Recasting Gradient-Based Meta-Learning as Hierarchical Bayes , 2018, ICLR.

[5] Sergey Levine,et al. Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Variables , 2019, ICML.

[6] David Silver,et al. Meta-Gradient Reinforcement Learning , 2018, NeurIPS.

[7] Tom Schaul,et al. Reinforcement Learning with Unsupervised Auxiliary Tasks , 2016, ICLR.

[8] Eduardo F. Morales,et al. An Introduction to Reinforcement Learning , 2011 .

[9] Zeb Kurth-Nelson,et al. Learning to reinforcement learn , 2016, CogSci.

[10] Pieter Abbeel,et al. Some Considerations on Learning to Explore via Meta-Reinforcement Learning , 2018, ICLR 2018.

[11] Jane X. Wang,et al. Meta-learning in natural and artificial intelligence , 2020, Current Opinion in Behavioral Sciences.

[12] Guy Lever,et al. Human-level performance in 3D multiplayer games with population-based reinforcement learning , 2018, Science.

[13] Ruslan Salakhutdinov,et al. Actor-Mimic: Deep Multitask and Transfer Reinforcement Learning , 2015, ICLR.

[14] Razvan Pascanu,et al. Stabilizing Transformers for Reinforcement Learning , 2019, ICML.

[15] Razvan Pascanu,et al. Distilling Policy Distillation , 2019, AISTATS.

[16] Sergey Levine,et al. Emergent Complexity and Zero-shot Transfer via Unsupervised Environment Design , 2020, NeurIPS.

[17] Sergey Levine,et al. Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks , 2017, ICML.

[18] Mark Chen,et al. Language Models are Few-Shot Learners , 2020, NeurIPS.

[19] Demis Hassabis,et al. Mastering Atari, Go, chess and shogi by planning with a learned model , 2019, Nature.

[20] Sergey Levine,et al. Meta-World: A Benchmark and Evaluation for Multi-Task and Meta Reinforcement Learning , 2019, CoRL.

[21] Marwan Mattar,et al. Unity: A General Platform for Intelligent Agents , 2018, ArXiv.

[22] Max Jaderberg,et al. Population Based Training of Neural Networks , 2017, ArXiv.

[23] Joshua B. Tenenbaum,et al. Building machines that learn and think like people , 2016, Behavioral and Brain Sciences.

[24] Michael O. Duff,et al. Design for an Optimal Probe , 2003, ICML.

[25] Wojciech M. Czarnecki,et al. Grandmaster level in StarCraft II using multi-agent reinforcement learning , 2019, Nature.

[26] Marek Wydmuch,et al. ViZDoom Competitions: Playing Doom From Pixels , 2018, IEEE Transactions on Games.

[27] Simon Carter,et al. Using Unity to Help Solve Intelligence , 2020, ArXiv.

[28] Taehoon Kim,et al. Quantifying Generalization in Reinforcement Learning , 2018, ICML.

[29] H. Francis Song,et al. V-MPO: On-Policy Maximum a Posteriori Policy Optimization for Discrete and Continuous Control , 2019, ICLR.

[30] W. J. Studden,et al. Theory Of Optimal Experiments , 1972 .

[31] Jane X. Wang,et al. Reinforcement Learning, Fast and Slow , 2019, Trends in Cognitive Sciences.