Alchemy: A benchmark and analysis toolkit for meta-reinforcement learning agents

There has been rapidly growing interest in meta-learning as a method for increasing the flexibility and sample efficiency of reinforcement learning. One problem in this area of research, however, has been a scarcity of adequate benchmark tasks. In general, the structure underlying past benchmarks has either been too simple to be inherently interesting, or too ill-defined to support principled analysis. In the present work, we introduce a new benchmark for meta-RL research, emphasizing transparency and potential for in-depth analysis as well as structural richness. Alchemy is a 3D video game, implemented in Unity, which involves a latent causal structure that is resampled procedurally from episode to episode, affording structure learning, online inference, hypothesis testing and action sequencing based on abstract domain knowledge. We evaluate a pair of powerful RL agents on Alchemy and present an in-depth analysis of one of these agents. Results clearly indicate a frank and specific failure of meta-learning, providing validation for Alchemy as a challenging benchmark for meta-RL. Concurrent with this report, we are releasing Alchemy as public resource, together with a suite of analysis tools and sample agent trajectories.

[1]  Eduardo F. Morales,et al.  An Introduction to Reinforcement Learning , 2011 .

[2]  W. Geisler Ideal Observer Analysis , 2002 .

[3]  Max Jaderberg,et al.  Population Based Training of Neural Networks , 2017, ArXiv.

[4]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[5]  Demis Hassabis,et al.  Mastering Atari, Go, chess and shogi by planning with a learned model , 2019, Nature.

[6]  P. Alam ‘T’ , 2021, Composites Engineering: An A–Z Guide.

[7]  Jane X. Wang,et al.  Meta-learning in natural and artificial intelligence , 2020, Current Opinion in Behavioral Sciences.

[8]  Guy Lever,et al.  Human-level performance in 3D multiplayer games with population-based reinforcement learning , 2018, Science.

[9]  Joshua B. Tenenbaum,et al.  Building machines that learn and think like people , 2016, Behavioral and Brain Sciences.

[10]  Wojciech M. Czarnecki,et al.  Grandmaster level in StarCraft II using multi-agent reinforcement learning , 2019, Nature.

[11]  Marek Wydmuch,et al.  ViZDoom Competitions: Playing Doom From Pixels , 2018, IEEE Transactions on Games.

[12]  S. Levine,et al.  Guided Meta-Policy Search , 2019, NeurIPS.

[13]  J. Gittins Bandit processes and dynamic allocation indices , 1979 .

[14]  Andrew Zisserman,et al.  Kickstarting Deep Reinforcement Learning , 2018, ArXiv.

[15]  Sergey Levine,et al.  Meta-World: A Benchmark and Evaluation for Multi-Task and Meta Reinforcement Learning , 2019, CoRL.

[16]  Shane Legg,et al.  Meta-trained agents implement Bayes-optimal agents , 2020, NeurIPS.

[17]  Taehoon Kim,et al.  Quantifying Generalization in Reinforcement Learning , 2018, ICML.

[18]  Yuval Tassa,et al.  MuJoCo: A physics engine for model-based control , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[19]  Yee Whye Teh,et al.  Meta reinforcement learning as task inference , 2019, ArXiv.

[20]  H. Francis Song,et al.  V-MPO: On-Policy Maximum a Posteriori Policy Optimization for Discrete and Continuous Control , 2019, ICLR.

[21]  Yuval Tassa,et al.  dm_control: Software and Tasks for Continuous Control , 2020, Softw. Impacts.

[22]  Razvan Pascanu,et al.  Distilling Policy Distillation , 2019, AISTATS.

[23]  Roozbeh Mottaghi,et al.  ALFRED: A Benchmark for Interpreting Grounded Instructions for Everyday Tasks , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Chelsea Finn,et al.  Decoupling Exploration and Exploitation for Meta-Reinforcement Learning without Sacrifices , 2020, ICML.

[25]  Shane Legg,et al.  IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures , 2018, ICML.

[26]  C A Nelson,et al.  Learning to Learn , 2017, Encyclopedia of Machine Learning and Data Mining.

[27]  Song-Chun Zhu,et al.  HALMA: Humanlike Abstraction Learning Meets Affordance in Rapid Problem Solving , 2021, ArXiv.

[28]  Sergey Levine,et al.  Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Variables , 2019, ICML.

[29]  Ruslan Salakhutdinov,et al.  Actor-Mimic: Deep Multitask and Transfer Reinforcement Learning , 2015, ICLR.

[30]  Charles Kemp,et al.  How to Grow a Mind: Statistics, Structure, and Abstraction , 2011, Science.

[31]  Marwan Mattar,et al.  Unity: A General Platform for Intelligent Agents , 2018, ArXiv.

[32]  Yee Whye Teh,et al.  Meta-learning of Sequential Strategies , 2019, ArXiv.

[33]  Sergey Levine,et al.  Emergent Complexity and Zero-shot Transfer via Unsupervised Environment Design , 2020, NeurIPS.

[34]  Edward Grefenstette,et al.  The NetHack Learning Environment , 2020, NeurIPS.

[35]  Peter L. Bartlett,et al.  RL$^2$: Fast Reinforcement Learning via Slow Reinforcement Learning , 2016, ArXiv.

[36]  Michael O. Duff,et al.  Design for an Optimal Probe , 2003, ICML.

[37]  Zeb Kurth-Nelson,et al.  Causal Reasoning from Meta-reinforcement Learning , 2019, ArXiv.

[38]  Samuel J. Gershman,et al.  Human-Level Reinforcement Learning through Theory-Based Modeling, Exploration, and Planning , 2021, ArXiv.

[39]  Tom Schaul,et al.  Reinforcement Learning with Unsupervised Auxiliary Tasks , 2016, ICLR.

[40]  Daan Wierstra,et al.  Meta-Learning with Memory-Augmented Neural Networks , 2016, ICML.

[41]  David M. Sobel,et al.  A theory of causal learning in children: causal maps and Bayes nets. , 2004, Psychological review.

[42]  Jane X. Wang,et al.  Reinforcement Learning, Fast and Slow , 2019, Trends in Cognitive Sciences.

[43]  T. Robbins,et al.  Decision Making, Affect, and Learning: Attention and Performance XXIII , 2011 .

[44]  Zeb Kurth-Nelson,et al.  Learning to reinforcement learn , 2016, CogSci.

[45]  David Silver,et al.  Meta-Gradient Reinforcement Learning , 2018, NeurIPS.

[46]  J. Vanschoren Meta-Learning , 2018, Automated Machine Learning.

[47]  Pieter Abbeel,et al.  Some Considerations on Learning to Explore via Meta-Reinforcement Learning , 2018, ICLR 2018.

[48]  Shimon Whiteson,et al.  VariBAD: A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning , 2020, ICLR.

[49]  Simon Farrell,et al.  Computational Modeling of Cognition and Behavior , 2018 .

[50]  Jonathan Baxter,et al.  Theoretical Models of Learning to Learn , 1998, Learning to Learn.

[51]  Katja Hofmann,et al.  The Malmo Platform for Artificial Intelligence Experimentation , 2016, IJCAI.

[52]  John Schulman,et al.  Gotta Learn Fast: A New Benchmark for Generalization in RL , 2018, ArXiv.

[53]  Razvan Pascanu,et al.  Progressive Neural Networks , 2016, ArXiv.

[54]  Razvan Pascanu,et al.  Stabilizing Transformers for Reinforcement Learning , 2019, ICML.

[55]  Pieter Abbeel,et al.  A Simple Neural Attentive Meta-Learner , 2017, ICLR.

[56]  J. Schulman,et al.  Leveraging Procedural Generation to Benchmark Reinforcement Learning , 2019, ICML.

[57]  W. J. Studden,et al.  Theory Of Optimal Experiments , 1972 .

[58]  Pieter Abbeel,et al.  The Importance of Sampling inMeta-Reinforcement Learning , 2018, NeurIPS.

[59]  Yoshua Bengio,et al.  Learning a synaptic learning rule , 1991, IJCNN-91-Seattle International Joint Conference on Neural Networks.

[60]  Marc G. Bellemare,et al.  The Arcade Learning Environment: An Evaluation Platform for General Agents (Extended Abstract) , 2012, IJCAI.

[61]  Thomas L. Griffiths,et al.  Recasting Gradient-Based Meta-Learning as Hierarchical Bayes , 2018, ICLR.

[62]  Julian Togelius,et al.  Obstacle Tower: A Generalization Challenge in Vision, Control, and Planning , 2019, IJCAI.

[63]  Stephen Clark,et al.  Grounded Language Learning Fast and Slow , 2021, ICLR.

[64]  Sergey Levine,et al.  Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks , 2017, ICML.

[65]  Julian Togelius,et al.  Procedural Content Generation: From Automatically Generating Game Levels to Increasing Generality in Machine Learning , 2019, ArXiv.

[66]  Tor Lattimore,et al.  Behaviour Suite for Reinforcement Learning , 2019, ICLR.

[67]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[68]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[69]  P. Alam,et al.  H , 1887, High Explosives, Propellants, Pyrotechnics.

[70]  Peter Stone,et al.  Reinforcement learning , 2019, Scholarpedia.

[71]  Timothy E. J. Behrens,et al.  Learning the value of information in an uncertain world , 2007, Nature Neuroscience.

[72]  Jeffrey C Erlich,et al.  Decision-making behaviors: weighing ethology, complexity, and sensorimotor compatibility , 2018, Current Opinion in Neurobiology.

[73]  Sergey Levine,et al.  Meta-Reinforcement Learning of Structured Exploration Strategies , 2018, NeurIPS.

[74]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.