Learning to Solve Voxel Building Embodied Tasks from Pixels and Natural Language Instructions

The adoption of pre-trained language models to generate action plans for embodied agents is a promising research strategy. However, execution of instructions in real or simulated environments requires verification of the feasibility of actions as well as their relevance to the completion of a goal. We propose a new method that combines a language model and reinforcement learning for the task of building objects in a Minecraft-like environment according to the natural language instructions. Our method first generates a set of consistently achievable sub-goals from the instructions and then completes associated sub-tasks with a pre-trained RL policy. The proposed method formed the RL baseline at the IGLU 2022 competition.

[1]  Human-guided Collaborative Problem Solving: A Natural Language based Framework , 2021 .

[2]  M. Cakmak,et al.  Following Natural Language Instructions for Household Tasks With Landmark Guided Search and Reinforced Pose Adjustment , 2022, IEEE Robotics and Automation Letters.

[3]  Marc-Alexandre Côté,et al.  IGLU Gridworld: Simple and Fast Environment for Embodied Dialog Agents , 2022, ArXiv.

[4]  Maartje ter Hoeve,et al.  IGLU 2022: Interactive Grounded Language Understanding in a Collaborative Environment at NeurIPS 2022 , 2022, ArXiv.

[5]  S. Levine,et al.  Do As I Can, Not As I Say: Grounding Language in Robotic Affordances , 2022, CoRL.

[6]  Adrian S. Wong,et al.  Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language , 2022, ICLR.

[7]  Ryan J. Lowe,et al.  Training language models to follow instructions with human feedback , 2022, NeurIPS.

[8]  P. Abbeel,et al.  Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents , 2022, ICML.

[9]  Devendra Singh Chaplot,et al.  FILM: Following Instructions in Language with Modular Methods , 2021, ICLR.

[10]  Maartje ter Hoeve,et al.  Interactive Grounded Language Understanding in a Collaborative Environment: IGLU 2021 , 2022, NeurIPS.

[11]  Cristian-Paul Bara,et al.  MindCraft: Theory of Mind Modeling for Situated Dialogue in Collaborative Tasks , 2021, EMNLP.

[12]  Julia Hockenmaier,et al.  Learning to execute instructions in a Minecraft dialogue , 2020, ACL.

[13]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[14]  Julia Hockenmaier,et al.  Collaborative Dialogue in Minecraft , 2019, ACL.

[15]  Shane Legg,et al.  IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures , 2018, ICML.