MineDojo: Building Open-Ended Embodied Agents with Internet-Scale Knowledge

Autonomous agents have made great strides in specialist domains like Atari games and Go. However, they typically learn tabula rasa in isolated environments with limited and manually conceived objectives, thus failing to generalize across a wide spectrum of tasks and capabilities. Inspired by how humans continually learn and adapt in the open world, we advocate a trinity of ingredients for building generalist agents: 1) an environment that supports a multitude of tasks and goals, 2) a large-scale database of multimodal knowledge, and 3) a flexible and scalable agent architecture. We introduce MineDojo, a new framework built on the popular Minecraft game that features a simulation suite with thousands of diverse open-ended tasks and an internet-scale knowledge base with Minecraft videos, tutorials, wiki pages, and forum discussions. Using MineDojo's data, we propose a novel agent learning algorithm that leverages large pre-trained video-language models as a learned reward function. Our agent is able to solve a variety of open-ended tasks specified in free-form language without any manually designed dense shaping reward. We open-source the simulation suite, knowledge bases, algorithm implementation, and pretrained models (https://minedojo.org) to promote research towards the goal of generally capable embodied agents.

[1]  Li Fei-Fei,et al.  VIMA: General Robot Manipulation with Multimodal Prompts , 2022, ArXiv.

[2]  Peter R. Florence,et al.  Inner Monologue: Embodied Reasoning through Planning with Language Models , 2022, CoRL.

[3]  J. Clune,et al.  Video PreTraining (VPT): Learning to Act by Watching Unlabeled Online Videos , 2022, NeurIPS.

[4]  Aniruddha Kembhavi,et al.  Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks , 2022, ICLR.

[5]  S. Gu,et al.  Large Language Models are Zero-Shot Reasoners , 2022, NeurIPS.

[6]  David J. Fleet,et al.  Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding , 2022, NeurIPS.

[7]  Pierre-Luc Bacon,et al.  The Primacy Bias in Deep Reinforcement Learning , 2022, ICML.

[8]  Sergio Gomez Colmenarejo,et al.  A Generalist Agent , 2022, Trans. Mach. Learn. Res..

[9]  Prafulla Dhariwal,et al.  Hierarchical Text-Conditional Image Generation with CLIP Latents , 2022, ArXiv.

[10]  Andrew M. Dai,et al.  PaLM: Scaling Language Modeling with Pathways , 2022, J. Mach. Learn. Res..

[11]  S. Levine,et al.  Do As I Can, Not As I Say: Grounding Language in Robotic Affordances , 2022, CoRL.

[12]  Adrian S. Wong,et al.  Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language , 2022, ICLR.

[13]  Vikash Kumar,et al.  R3M: A Universal Visual Representation for Robot Manipulation , 2022, CoRL.

[14]  Li Fei-Fei,et al.  MetaMorph: Learning Universal Controllers with Transformers , 2022, International Conference on Learning Representations.

[15]  Amy Zhang,et al.  Online Decision Transformer , 2022, ICML.

[16]  A. Torralba,et al.  Pre-Trained Language Models for Interactive Decision-Making , 2022, NeurIPS.

[17]  S. Gu,et al.  Can Wikipedia Help Offline Reinforcement Learning? , 2022, ArXiv.

[18]  Ross B. Girshick,et al.  Masked Autoencoders Are Scalable Vision Learners , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Jenia Jitsev,et al.  LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs , 2021, ArXiv.

[20]  James M. Rehg,et al.  Ego4D: Around the World in 3,000 Hours of Egocentric Video , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Alexey Skrynnik,et al.  NeurIPS 2021 Competition IGLU: Interactive Grounded Language Understanding in a Collaborative Environment , 2021, ArXiv.

[22]  Peng Gao,et al.  CLIP-Adapter: Better Vision-Language Models with Feature Adapters , 2021, Int. J. Comput. Vis..

[23]  Dmytro Okhonko,et al.  VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding , 2021, EMNLP.

[24]  Dieter Fox,et al.  CLIPort: What and Where Pathways for Robotic Manipulation , 2021, CoRL.

[25]  Danijar Hafner Benchmarking the Spectrum of Agent Capabilities , 2021, ICLR.

[26]  S. Savarese,et al.  Learning Language-Conditioned Robot Behavior from Offline Data and Crowd-Sourced Annotation , 2021, CoRL.

[27]  Michael S. Bernstein,et al.  On the Opportunities and Risks of Foundation Models , 2021, ArXiv.

[28]  Silvio Savarese,et al.  BEHAVIOR: Benchmark for Everyday Household Activities in Virtual, Interactive, and Ecological Environments , 2021, CoRL.

[29]  Pieter Abbeel,et al.  The MineRL BASALT Competition on Learning from Human Feedback , 2021, ArXiv.

[30]  Jeff Clune,et al.  Multi-task curriculum learning in a complex, visual, hard-exploration domain: Minecraft , 2021, ArXiv.

[31]  Bhargava Urala Kota,et al.  DocFormer: End-to-End Transformer for Document Understanding , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[32]  Li Fei-Fei,et al.  SECANT: Self-Expert Cloning for Zero-Shot Generalization of Visual Policies , 2021, ICML.

[33]  Jonathan Tompson,et al.  XIRL: Cross-embodiment Inverse Reinforcement Learning , 2021, CoRL.

[34]  Sergey Levine,et al.  Offline Reinforcement Learning as One Big Sequence Modeling Problem , 2021, NeurIPS.

[35]  Pieter Abbeel,et al.  Decision Transformer: Reinforcement Learning via Sequence Modeling , 2021, NeurIPS.

[36]  Doina Precup,et al.  AndroidEnv: A Reinforcement Learning Platform for Android , 2021, ArXiv.

[37]  Shih-Fu Chang,et al.  VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text , 2021, NeurIPS.

[38]  Nan Duan,et al.  CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval , 2021, Neurocomputing.

[39]  Chelsea Finn,et al.  Learning Generalizable Robotic Reward Functions from "In-The-Wild" Human Videos , 2021, Robotics: Science and Systems.

[40]  Lambert Schomaker,et al.  Self-Imitation Learning by Planning , 2021, 2021 IEEE International Conference on Robotics and Automation (ICRA).

[41]  Cheston Tan,et al.  A Survey of Embodied AI: From Simulators to Research Tasks , 2021, IEEE Transactions on Emerging Topics in Computational Intelligence.

[42]  Ilya Sutskever,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[43]  Diego Perez Liebana,et al.  The MineRL 2020 Competition on Sample Efficient Reinforcement Learning using Human Priors , 2021, ArXiv.

[44]  Charles Foster,et al.  The Pile: An 800GB Dataset of Diverse Text for Language Modeling , 2020, ArXiv.

[45]  Cha Zhang,et al.  LayoutLMv2: Multi-modal Pre-training for Visually-rich Document Understanding , 2020, ACL.

[46]  Felix Hill,et al.  Imitating Interactive Intelligence , 2020, ArXiv.

[47]  Rasmus Berg Palm,et al.  EvoCraft: A New Challenge for Open-Endedness , 2020, EvoApplications.

[48]  Lyne P. Tchapmi,et al.  iGibson 1.0: A Simulation Environment for Interactive Tasks in Large Realistic Scenes , 2020, 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[49]  Natasha Jaques,et al.  Emergent Complexity and Zero-shot Transfer via Unsupervised Environment Design , 2020, NeurIPS.

[50]  Roozbeh Mottaghi,et al.  Rearrangement: A Challenge for Embodied AI , 2020, ArXiv.

[51]  S. Gelly,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.

[52]  Yejin Choi,et al.  RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models , 2020, FINDINGS.

[53]  Jeannette Bohg,et al.  Concept2Robot: Learning manipulation concepts from instructions and human demonstrations , 2020, Robotics: Science and Systems.

[54]  Andrew Zisserman,et al.  Self-Supervised MultiModal Versatile Networks , 2020, NeurIPS.

[55]  Alexander Toshev,et al.  ObjectNav Revisited: On Evaluation of Embodied Agents Navigating to Objects , 2020, ArXiv.

[56]  Edward Grefenstette,et al.  The NetHack Learning Environment , 2020, NeurIPS.

[57]  Yi Yang,et al.  ActBERT: Learning Global-Local Video-Text Representations , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[58]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[59]  Sen Wu,et al.  Understanding and Improving Information Transfer in Multi-Task Learning , 2020, ICLR.

[60]  Justin Fu,et al.  D4RL: Datasets for Deep Data-Driven Reinforcement Learning , 2020, ArXiv.

[61]  Jingkang Wang,et al.  BabyAI++: Towards Grounded-Language Learning beyond Memorization , 2020, ArXiv.

[62]  Joel Lehman,et al.  Enhanced POET: Open-Ended Reinforcement Learning through Unbounded Invention of Learning Challenges and their Solutions , 2020, ICML.

[63]  Rami Ben-Ari,et al.  Noise Estimation Using Density Estimation for Self-Supervised Multimodal Learning , 2020, AAAI.

[64]  Geoffrey E. Hinton,et al.  A Simple Framework for Contrastive Learning of Visual Representations , 2020, ICML.

[65]  Noam Shazeer,et al.  GLU Variants Improve Transformer , 2020, ArXiv.

[66]  S. Levine,et al.  Gradient Surgery for Multi-Task Learning , 2020, NeurIPS.

[67]  Furu Wei,et al.  LayoutLM: Pre-training of Text and Layout for Document Image Understanding , 2019, KDD.

[68]  Andrew Zisserman,et al.  End-to-End Learning of Visual Representations From Uncurated Instructional Videos , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[69]  Jakub W. Pachocki,et al.  Dota 2 with Large Scale Deep Reinforcement Learning , 2019, ArXiv.

[70]  Pieter Abbeel,et al.  AVID: Learning Multi-Stage Tasks via Pixel-Level Translation of Human Videos , 2019, Robotics: Science and Systems.

[71]  Luke Zettlemoyer,et al.  ALFRED: A Benchmark for Interpreting Grounded Instructions for Everyday Tasks , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[72]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[73]  Ross B. Girshick,et al.  Momentum Contrast for Unsupervised Visual Representation Learning , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[74]  S. Levine,et al.  RoboNet: Large-Scale Multi-Robot Learning , 2019, CoRL.

[75]  R'emi Louf,et al.  HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[76]  Cordelia Schmid,et al.  Learning Video Representations using Contrastive Bidirectional Transformer , 2019 .

[77]  M. Shoeybi,et al.  Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism , 2019, ArXiv.

[78]  Ruslan Salakhutdinov,et al.  MineRL: A Large-Scale Dataset of Minecraft Demonstrations , 2019, IJCAI.

[79]  Ivan Laptev,et al.  HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[80]  Peter Stone,et al.  Recent Advances in Imitation Learning from Observation , 2019, IJCAI.

[81]  Matthew Henderson,et al.  A Repository of Conversational Datasets , 2019, Proceedings of the First Workshop on NLP for Conversational AI.

[82]  Jitendra Malik,et al.  Habitat: A Platform for Embodied AI Research , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[83]  J. Togelius,et al.  Obstacle Tower: A Generalization Challenge in Vision, Control, and Planning , 2019, IJCAI.

[84]  Kenneth O. Stanley,et al.  Go-Explore: a New Approach for Hard-Exploration Problems , 2019, ArXiv.

[85]  Sam Devlin,et al.  The Multi-Agent Reinforcement Learning in MalmÖ (MARLÖ) Competition , 2019, ArXiv.

[86]  Rui Wang,et al.  Paired Open-Ended Trailblazer (POET): Endlessly Generating Increasingly Complex and Diverse Learning Environments and Their Solutions , 2019, ArXiv.

[87]  Gunhee Kim,et al.  Abstractive Summarization of Reddit Posts with Multi-level Memory Networks , 2018, NAACL.

[88]  Thien Huu Nguyen,et al.  BabyAI: A Platform to Study the Sample Efficiency of Grounded Language Learning , 2018, ICLR.

[89]  Oriol Vinyals,et al.  Representation Learning with Contrastive Predictive Coding , 2018, ArXiv.

[90]  Jeff Clune,et al.  Evolving Multimodal Robot Behavior via Many Stepping Stones with the Combinatorial Multiobjective Evolutionary Algorithm , 2018, Evolutionary Computation.

[91]  Sanja Fidler,et al.  VirtualHome: Simulating Household Activities Via Programs , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[92]  P. Stone,et al.  Behavioral Cloning from Observation , 2018, IJCAI.

[93]  Yannick Schroecker,et al.  Imitating Latent Policies from Observation , 2018, ICML.

[94]  Ray Kurzweil,et al.  Learning Semantic Textual Similarity from Conversations , 2018, Rep4NLP@ACL.

[95]  Ali Farhadi,et al.  AI2-THOR: An Interactive 3D Environment for Visual AI , 2017, ArXiv.

[96]  Chen Sun,et al.  Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification , 2017, ECCV.

[97]  Demis Hassabis,et al.  Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm , 2017, ArXiv.

[98]  Hao Wu,et al.  Mixed Precision Training , 2017, ICLR.

[99]  Benno Stein,et al.  TL;DR: Mining Reddit to Learn Automatic Summarization , 2017, NFiS@EMNLP.

[100]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[101]  Percy Liang,et al.  World of Bits: An Open-Domain Platform for Web-Based Agents , 2017, ICML.

[102]  Sepp Hochreiter,et al.  GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium , 2017, NIPS.

[103]  Susanne Westphal,et al.  The “Something Something” Video Database for Learning and Evaluating Visual Common Sense , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[104]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[105]  Fabio Viola,et al.  The Kinetics Human Action Video Dataset , 2017, ArXiv.

[106]  Pieter Abbeel,et al.  Third-Person Imitation Learning , 2017, ICLR.

[107]  Dan Klein,et al.  Modular Multitask Reinforcement Learning with Policy Sketches , 2016, ICML.

[108]  Apostol Natsev,et al.  YouTube-8M: A Large-Scale Video Classification Benchmark , 2016, ArXiv.

[109]  Frank Hutter,et al.  SGDR: Stochastic Gradient Descent with Warm Restarts , 2016, ICLR.

[110]  Kenneth O. Stanley,et al.  Open-Ended Evolution: Perspectives from the OEE Workshop in York , 2016, Artificial Life.

[111]  Katja Hofmann,et al.  The Malmo Platform for Artificial Intelligence Experimentation , 2016, IJCAI.

[112]  Wojciech Zaremba,et al.  Improved Techniques for Training GANs , 2016, NIPS.

[113]  Javier Snaider,et al.  Conversational Contextual Cues: The Case of Personalization and History for Response Ranking , 2016, ArXiv.

[114]  Sergey Levine,et al.  High-Dimensional Continuous Control Using Generalized Advantage Estimation , 2015, ICLR.

[115]  Alex Graves,et al.  Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.

[116]  Russell K. Standish,et al.  Open-Ended Artificial Evolution , 2002, Int. J. Comput. Intell. Appl..

[117]  Max Jaderberg,et al.  Open-Ended Learning Leads to Generally Capable Agents , 2021, ArXiv.

[118]  David Howard,et al.  A Review of Physics Simulators for Robotic Applications , 2021, IEEE Access.

[119]  Juan Carlos Niebles,et al.  RubiksNet: Learnable 3D-Shift for Efficient Video Action Recognition , 2020, ECCV.

[120]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[121]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[122]  Alec Radford,et al.  Improving Language Understanding by Generative Pre-Training , 2018 .

[123]  William B. Langdon,et al.  Pfeiffer - A Distributed Open-ended Evolutionary System , 2005 .

[124]  Sonia Chernova,et al.  Recent Advances in Robot Learning from Demonstration , 2020, Annu. Rev. Control. Robotics Auton. Syst..