MineDojo: Building Open-Ended Embodied Agents with Internet-Scale Knowledge

Autonomous agents have made great strides in specialist domains like Atari games and Go. However, they typically learn tabula rasa in isolated environments with limited and manually conceived objectives, thus failing to generalize across a wide spectrum of tasks and capabilities. Inspired by how humans continually learn and adapt in the open world, we advocate a trinity of ingredients for building generalist agents: 1) an environment that supports a multitude of tasks and goals, 2) a large-scale database of multimodal knowledge, and 3) a flexible and scalable agent architecture. We introduce M INE D OJO , a new framework built on the popular Minecraft game that features a simulation suite with thousands of diverse open-ended tasks and an internet-scale knowledge base with Minecraft videos, tutorials, wiki pages, and forum discussions. Using M INE D OJO ’s data, we propose a novel agent learning algorithm that leverages large pre-trained video-language models as a learned reward function. Our agent is able to solve a variety of open-ended tasks specified in free-form language without any manually designed dense shaping reward. We open-source the simulation suite and knowledge bases ( https://minedojo.org ) to promote research towards the goal of generally capable embodied agents.

[1]  S. Gu,et al.  Large Language Models are Zero-Shot Reasoners , 2022, ArXiv.

[2]  Pierre-Luc Bacon,et al.  The Primacy Bias in Deep Reinforcement Learning , 2022, ICML.

[3]  Andrew M. Dai,et al.  PaLM: Scaling Language Modeling with Pathways , 2022, J. Mach. Learn. Res..

[4]  S. Levine,et al.  Do As I Can, Not As I Say: Grounding Language in Robotic Affordances , 2022, CoRL.

[5]  Vikash Kumar,et al.  R3M: A Universal Visual Representation for Robot Manipulation , 2022, ArXiv.

[6]  Li Fei-Fei,et al.  MetaMorph: Learning Universal Controllers with Transformers , 2022, International Conference on Learning Representations.

[7]  Amy Zhang,et al.  Online Decision Transformer , 2022, ICML.

[8]  A. Torralba,et al.  Pre-Trained Language Models for Interactive Decision-Making , 2022, NeurIPS.

[9]  S. Gu,et al.  Can Wikipedia Help Offline Reinforcement Learning? , 2022, ArXiv.

[10]  Ross B. Girshick,et al.  Masked Autoencoders Are Scalable Vision Learners , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  James M. Rehg,et al.  Ego4D: Around the World in 3,000 Hours of Egocentric Video , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Danijar Hafner Benchmarking the Spectrum of Agent Capabilities , 2021, ICLR.

[13]  Jeff Clune,et al.  Evolving Multimodal Robot Behavior via Many Stepping Stones with the Combinatorial Multiobjective Evolutionary Algorithm , 2018, Evolutionary Computation.

[14]  Jenia Jitsev,et al.  LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs , 2021, ArXiv.

[15]  Alexey Skrynnik,et al.  NeurIPS 2021 Competition IGLU: Interactive Grounded Language Understanding in a Collaborative Environment , 2021, ArXiv.

[16]  Peng Gao,et al.  CLIP-Adapter: Better Vision-Language Models with Feature Adapters , 2021, Int. J. Comput. Vis..

[17]  Dmytro Okhonko,et al.  VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding , 2021, EMNLP.

[18]  Dieter Fox,et al.  CLIPort: What and Where Pathways for Robotic Manipulation , 2021, CoRL.

[19]  S. Savarese,et al.  Learning Language-Conditioned Robot Behavior from Offline Data and Crowd-Sourced Annotation , 2021, CoRL.

[20]  Michael S. Bernstein,et al.  On the Opportunities and Risks of Foundation Models , 2021, ArXiv.

[21]  Silvio Savarese,et al.  BEHAVIOR: Benchmark for Everyday Household Activities in Virtual, Interactive, and Ecological Environments , 2021, CoRL.

[22]  Pieter Abbeel,et al.  The MineRL BASALT Competition on Learning from Human Feedback , 2021, ArXiv.

[23]  Jeff Clune,et al.  Multi-task curriculum learning in a complex, visual, hard-exploration domain: Minecraft , 2021, ArXiv.

[24]  Bhargava Urala Kota,et al.  DocFormer: End-to-End Transformer for Document Understanding , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[25]  Li Fei-Fei,et al.  SECANT: Self-Expert Cloning for Zero-Shot Generalization of Visual Policies , 2021, ICML.

[26]  Jonathan Tompson,et al.  XIRL: Cross-embodiment Inverse Reinforcement Learning , 2021, CoRL.

[27]  Sergey Levine,et al.  Offline Reinforcement Learning as One Big Sequence Modeling Problem , 2021, NeurIPS.

[28]  Pieter Abbeel,et al.  Decision Transformer: Reinforcement Learning via Sequence Modeling , 2021, NeurIPS.

[29]  Doina Precup,et al.  AndroidEnv: A Reinforcement Learning Platform for Android , 2021, ArXiv.

[30]  Shih-Fu Chang,et al.  VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text , 2021, NeurIPS.

[31]  Nan Duan,et al.  CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval , 2021, Neurocomputing.

[32]  Chelsea Finn,et al.  Learning Generalizable Robotic Reward Functions from "In-The-Wild" Human Videos , 2021, Robotics: Science and Systems.

[33]  Ilya Sutskever,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[34]  Charles Foster,et al.  The Pile: An 800GB Dataset of Diverse Text for Language Modeling , 2020, ArXiv.

[35]  Cha Zhang,et al.  LayoutLMv2: Multi-modal Pre-training for Visually-rich Document Understanding , 2020, ACL.

[36]  Rasmus Berg Palm,et al.  EvoCraft: A New Challenge for Open-Endedness , 2020, EvoApplications.

[37]  Lyne P. Tchapmi,et al.  iGibson 1.0: A Simulation Environment for Interactive Tasks in Large Realistic Scenes , 2020, 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[38]  S. Gelly,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.

[39]  Rami Ben-Ari,et al.  Noise Estimation Using Density Estimation for Self-Supervised Multimodal Learning , 2020, AAAI.

[40]  Max Jaderberg,et al.  Open-Ended Learning Leads to Generally Capable Agents , 2021, ArXiv.

[41]  Felix Hill,et al.  Imitating Interactive Intelligence , 2020, ArXiv.

[42]  Yejin Choi,et al.  RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models , 2020, FINDINGS.

[43]  Jeannette Bohg,et al.  Concept2Robot: Learning manipulation concepts from instructions and human demonstrations , 2020, Robotics: Science and Systems.

[44]  Andrew Zisserman,et al.  Self-Supervised MultiModal Versatile Networks , 2020, NeurIPS.

[45]  Edward Grefenstette,et al.  The NetHack Learning Environment , 2020, NeurIPS.

[46]  Yi Yang,et al.  ActBERT: Learning Global-Local Video-Text Representations , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[47]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[48]  Justin Fu,et al.  D4RL: Datasets for Deep Data-Driven Reinforcement Learning , 2020, ArXiv.

[49]  Jingkang Wang,et al.  BabyAI++: Towards Grounded-Language Learning beyond Memorization , 2020, ArXiv.

[50]  Joel Lehman,et al.  Enhanced POET: Open-Ended Reinforcement Learning through Unbounded Invention of Learning Challenges and their Solutions , 2020, ICML.

[51]  Geoffrey E. Hinton,et al.  A Simple Framework for Contrastive Learning of Visual Representations , 2020, ICML.

[52]  Noam Shazeer,et al.  GLU Variants Improve Transformer , 2020, ArXiv.

[53]  Furu Wei,et al.  LayoutLM: Pre-training of Text and Layout for Document Image Understanding , 2019, KDD.

[54]  Andrew Zisserman,et al.  End-to-End Learning of Visual Representations From Uncurated Instructional Videos , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[55]  Pieter Abbeel,et al.  AVID: Learning Multi-Stage Tasks via Pixel-Level Translation of Human Videos , 2019, Robotics: Science and Systems.

[56]  Luke Zettlemoyer,et al.  ALFRED: A Benchmark for Interpreting Grounded Instructions for Everyday Tasks , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[57]  Ross B. Girshick,et al.  Momentum Contrast for Unsupervised Visual Representation Learning , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[58]  Juan Carlos Niebles,et al.  RubiksNet: Learnable 3D-Shift for Efficient Video Action Recognition , 2020, ECCV.

[59]  Jakub W. Pachocki,et al.  Dota 2 with Large Scale Deep Reinforcement Learning , 2019, ArXiv.

[60]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[61]  S. Levine,et al.  RoboNet: Large-Scale Multi-Robot Learning , 2019, Conference on Robot Learning.

[62]  R'emi Louf,et al.  HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[63]  Cordelia Schmid,et al.  Learning Video Representations using Contrastive Bidirectional Transformer , 2019 .

[64]  M. Shoeybi,et al.  Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism , 2019, ArXiv.

[65]  Ruslan Salakhutdinov,et al.  MineRL: A Large-Scale Dataset of Minecraft Demonstrations , 2019, IJCAI.

[66]  Ivan Laptev,et al.  HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[67]  Peter Stone,et al.  Recent Advances in Imitation Learning from Observation , 2019, IJCAI.

[68]  Katja Hofmann,et al.  The MineRL Competition on Sample Efficient Reinforcement Learning using Human Priors , 2019, ArXiv.

[69]  Matthew Henderson,et al.  A Repository of Conversational Datasets , 2019, Proceedings of the First Workshop on NLP for Conversational AI.

[70]  Jitendra Malik,et al.  Habitat: A Platform for Embodied AI Research , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[71]  Julian Togelius,et al.  Obstacle Tower: A Generalization Challenge in Vision, Control, and Planning , 2019, IJCAI.

[72]  Kenneth O. Stanley,et al.  Go-Explore: a New Approach for Hard-Exploration Problems , 2019, ArXiv.

[73]  Rui Wang,et al.  Paired Open-Ended Trailblazer (POET): Endlessly Generating Increasingly Complex and Diverse Learning Environments and Their Solutions , 2019, ArXiv.

[74]  Gunhee Kim,et al.  Abstractive Summarization of Reddit Posts with Multi-level Memory Networks , 2018, NAACL.

[75]  Thien Huu Nguyen,et al.  BabyAI: A Platform to Study the Sample Efficiency of Grounded Language Learning , 2018, ICLR.

[76]  Yannick Schroecker,et al.  Imitating Latent Policies from Observation , 2018, ICML.

[77]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[78]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[79]  Oriol Vinyals,et al.  Representation Learning with Contrastive Predictive Coding , 2018, ArXiv.

[80]  Satinder Singh,et al.  Self-Imitation Learning , 2018, ICML.

[81]  Sanja Fidler,et al.  VirtualHome: Simulating Household Activities Via Programs , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[82]  Peter Stone,et al.  Behavioral Cloning from Observation , 2018, IJCAI.

[83]  Ray Kurzweil,et al.  Learning Semantic Textual Similarity from Conversations , 2018, Rep4NLP@ACL.

[84]  Chen Sun,et al.  Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification , 2017, ECCV.

[85]  Hao Wu,et al.  Mixed Precision Training , 2017, ICLR.

[86]  Alec Radford,et al.  Improving Language Understanding by Generative Pre-Training , 2018 .

[87]  Ali Farhadi,et al.  AI2-THOR: An Interactive 3D Environment for Visual AI , 2017, ArXiv.

[88]  Demis Hassabis,et al.  Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm , 2017, ArXiv.

[89]  Benno Stein,et al.  TL;DR: Mining Reddit to Learn Automatic Summarization , 2017, NFiS@EMNLP.

[90]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[91]  Percy Liang,et al.  World of Bits: An Open-Domain Platform for Web-Based Agents , 2017, ICML.

[92]  Sepp Hochreiter,et al.  GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium , 2017, NIPS.

[93]  Susanne Westphal,et al.  The “Something Something” Video Database for Learning and Evaluating Visual Common Sense , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[94]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[95]  Fabio Viola,et al.  The Kinetics Human Action Video Dataset , 2017, ArXiv.

[96]  Pieter Abbeel,et al.  Third-Person Imitation Learning , 2017, ICLR.

[97]  Frank Hutter,et al.  SGDR: Stochastic Gradient Descent with Warm Restarts , 2016, ICLR.

[98]  Apostol Natsev,et al.  YouTube-8M: A Large-Scale Video Classification Benchmark , 2016, ArXiv.

[99]  Kenneth O. Stanley,et al.  Open-Ended Evolution: Perspectives from the OEE Workshop in York , 2016, Artificial Life.

[100]  Katja Hofmann,et al.  The Malmo Platform for Artificial Intelligence Experimentation , 2016, IJCAI.

[101]  Wojciech Zaremba,et al.  Improved Techniques for Training GANs , 2016, NIPS.

[102]  Javier Snaider,et al.  Conversational Contextual Cues: The Case of Personalization and History for Response Ranking , 2016, ArXiv.

[103]  Sergey Levine,et al.  High-Dimensional Continuous Control Using Generalized Advantage Estimation , 2015, ICLR.

[104]  Alex Graves,et al.  Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.

[105]  William B. Langdon,et al.  Pfeiffer - A Distributed Open-ended Evolutionary System , 2005 .

[106]  Russell K. Standish,et al.  Open-Ended Artificial Evolution , 2002, Int. J. Comput. Intell. Appl..