Alchemy: A structured task distribution for meta-reinforcement learning

There has been rapidly growing interest in metalearning as a method for increasing the flexibility and sample efficiency of reinforcement learning. One problem in this area of research, however, has been a scarcity of adequate benchmark tasks. In general, the structure underlying past benchmarks has either been too simple to be inherently interesting, or too ill-defined to support principled analysis. In the present work, we introduce a new benchmark for meta-RL research, which combines structural richness with structural transparency. Alchemy is a 3D video game, implemented in Unity, which involves a latent causal structure that is resampled procedurally from episode to episode, affording structure learning, online inference, hypothesis testing and action sequencing based on abstract domain knowledge. We evaluate a pair of powerful RL agents on Alchemy and present an in-depth analysis of one of these agents. Results clearly indicate a frank and specific failure of meta-learning, providing validation for Alchemy as a challenging benchmark for meta-RL. Concurrent with this report, we are releasing Alchemy as public resource, together with a suite of analysis tools and sample agent trajectories.

[1]  C A Nelson,et al.  Learning to Learn , 2017, Encyclopedia of Machine Learning and Data Mining.

[2]  Jonathan Baxter,et al.  Theoretical Models of Learning to Learn , 1998, Learning to Learn.

[3]  Marc G. Bellemare,et al.  The Arcade Learning Environment: An Evaluation Platform for General Agents , 2012, J. Artif. Intell. Res..

[4]  Thomas L. Griffiths,et al.  Recasting Gradient-Based Meta-Learning as Hierarchical Bayes , 2018, ICLR.

[5]  Sergey Levine,et al.  Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Variables , 2019, ICML.

[6]  David Silver,et al.  Meta-Gradient Reinforcement Learning , 2018, NeurIPS.

[7]  Tom Schaul,et al.  Reinforcement Learning with Unsupervised Auxiliary Tasks , 2016, ICLR.

[8]  Eduardo F. Morales,et al.  An Introduction to Reinforcement Learning , 2011 .

[9]  Zeb Kurth-Nelson,et al.  Learning to reinforcement learn , 2016, CogSci.

[10]  Pieter Abbeel,et al.  Some Considerations on Learning to Explore via Meta-Reinforcement Learning , 2018, ICLR 2018.

[11]  Jane X. Wang,et al.  Meta-learning in natural and artificial intelligence , 2020, Current Opinion in Behavioral Sciences.

[12]  Guy Lever,et al.  Human-level performance in 3D multiplayer games with population-based reinforcement learning , 2018, Science.

[13]  Ruslan Salakhutdinov,et al.  Actor-Mimic: Deep Multitask and Transfer Reinforcement Learning , 2015, ICLR.

[14]  Razvan Pascanu,et al.  Stabilizing Transformers for Reinforcement Learning , 2019, ICML.

[15]  Razvan Pascanu,et al.  Distilling Policy Distillation , 2019, AISTATS.

[16]  Sergey Levine,et al.  Emergent Complexity and Zero-shot Transfer via Unsupervised Environment Design , 2020, NeurIPS.

[17]  Sergey Levine,et al.  Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks , 2017, ICML.

[18]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[19]  Demis Hassabis,et al.  Mastering Atari, Go, chess and shogi by planning with a learned model , 2019, Nature.

[20]  Sergey Levine,et al.  Meta-World: A Benchmark and Evaluation for Multi-Task and Meta Reinforcement Learning , 2019, CoRL.

[21]  Marwan Mattar,et al.  Unity: A General Platform for Intelligent Agents , 2018, ArXiv.

[22]  Max Jaderberg,et al.  Population Based Training of Neural Networks , 2017, ArXiv.

[23]  Joshua B. Tenenbaum,et al.  Building machines that learn and think like people , 2016, Behavioral and Brain Sciences.

[24]  Michael O. Duff,et al.  Design for an Optimal Probe , 2003, ICML.

[25]  Wojciech M. Czarnecki,et al.  Grandmaster level in StarCraft II using multi-agent reinforcement learning , 2019, Nature.

[26]  Marek Wydmuch,et al.  ViZDoom Competitions: Playing Doom From Pixels , 2018, IEEE Transactions on Games.

[27]  Simon Carter,et al.  Using Unity to Help Solve Intelligence , 2020, ArXiv.

[28]  Taehoon Kim,et al.  Quantifying Generalization in Reinforcement Learning , 2018, ICML.

[29]  H. Francis Song,et al.  V-MPO: On-Policy Maximum a Posteriori Policy Optimization for Discrete and Continuous Control , 2019, ICLR.

[30]  W. J. Studden,et al.  Theory Of Optimal Experiments , 1972 .

[31]  Jane X. Wang,et al.  Reinforcement Learning, Fast and Slow , 2019, Trends in Cognitive Sciences.