ALFWorld: Aligning Text and Embodied Environments for Interactive Learning

Given a simple request (e.g., Put a washed apple in the kitchen fridge), humans can reason in purely abstract terms by imagining action sequences and scoring their likelihood of success, prototypicality, and efficiency, all without moving a muscle. Once we see the kitchen in question, we can update our abstract plans to fit the scene. Embodied agents require the same abilities, but existing work does not yet provide the infrastructure necessary for both reasoning abstractly and executing concretely. We address this limitation by introducing ALFWorld, a simulator that enables agents to learn abstract, text-based policies in TextWorld (Cote et al., 2018) and then execute goals from the ALFRED benchmark (Shridhar et al., 2020) in a rich visual environment. ALFWorld enables the creation of a new BUTLER agent whose abstract knowledge, learned in TextWorld, corresponds directly to concrete, visually grounded actions. In turn, as we demonstrate empirically, this fosters better agent generalization than training only in the visually grounded environment. BUTLER's simple, modular design factors the problem to allow researchers to focus on models for improving every piece of the pipeline (language understanding, planning, navigation, visual scene understanding, and so forth).

[1]  Li Fei-Fei,et al.  DenseCap: Fully Convolutional Localization Networks for Dense Captioning , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Hannes Schulz,et al.  Relevance of Unsupervised Metrics in Task-Oriented Dialogue for Evaluating Natural Language Generation , 2017, ArXiv.

[3]  Bowen Zhou,et al.  Pointing the Unknown Words , 2016, ACL.

[4]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[5]  Jürgen Schmidhuber,et al.  Recurrent World Models Facilitate Policy Evolution , 2018, NeurIPS.

[6]  Chelsea Finn,et al.  Language as an Abstraction for Hierarchical Deep Reinforcement Learning , 2019, NeurIPS.

[7]  Ali Farhadi,et al.  Visual Semantic Planning Using Deep Successor Representations , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[8]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[10]  Geoffrey E. Hinton,et al.  Layer Normalization , 2016, ArXiv.

[11]  Ali Farhadi,et al.  IQA: Visual Question Answering in Interactive Environments , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[12]  Malte Helmert,et al.  The Fast Downward Planning System , 2006, J. Artif. Intell. Res..

[13]  Margaret Mitchell,et al.  VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[14]  Peter Stone,et al.  Deep Recurrent Q-Learning for Partially Observable MDPs , 2015, AAAI Fall Symposia.

[15]  Cordelia Schmid,et al.  VideoBERT: A Joint Model for Video and Language Representation Learning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[16]  Benjamin Kuipers,et al.  Walk the Talk: Connecting Language, Knowledge, and Action in Route Instructions , 2006, AAAI.

[17]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[18]  Yuval Tassa,et al.  MuJoCo: A physics engine for model-based control , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[19]  Craig A. Knoblock,et al.  PDDL-the planning domain definition language , 1998 .

[20]  Romain Laroche,et al.  Counting to Explore and Generalize in Text-based Games , 2018, ArXiv.

[21]  Yuan-Fang Wang,et al.  Reinforced Cross-Modal Matching and Self-Supervised Imitation Learning for Vision-Language Navigation , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Edward Grefenstette,et al.  The NetHack Learning Environment , 2020, NeurIPS.

[23]  Stefan Lee,et al.  Embodied Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[24]  Licheng Yu,et al.  MAttNet: Modular Attention Network for Referring Expression Comprehension , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[25]  Jianqiang Huang,et al.  Unbiased Scene Graph Generation From Biased Training , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Matthew J. Hausknecht,et al.  Interactive Fiction Games: A Colossal Adventure , 2020, AAAI.

[27]  Ali Farhadi,et al.  AI2-THOR: An Interactive 3D Environment for Visual AI , 2017, ArXiv.

[28]  Joshua B. Tenenbaum,et al.  Deep Convolutional Inverse Graphics Network , 2015, NIPS.

[29]  Qi Wu,et al.  Vision-and-Language Navigation: Interpreting Visually-Grounded Navigation Instructions in Real Environments , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[30]  Jiajun Wu,et al.  Learning to See Physics via Visual De-animation , 2017, NIPS.

[31]  Stefan Lee,et al.  ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks , 2019, NeurIPS.

[32]  Lior Wolf,et al.  Using the Output Embedding to Improve Language Models , 2016, EACL.

[33]  Romain Laroche,et al.  Learning Dynamic Belief Graphs to Generalize on Text-Based Games , 2020, NeurIPS.

[34]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[35]  Yuandong Tian,et al.  Hierarchical Decision Making by Generating and Following Natural Language Instructions , 2019, NeurIPS.

[36]  Trevor Darrell,et al.  Modeling Relationships in Referential Expressions with Compositional Modular Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Roozbeh Mottaghi,et al.  ALFRED: A Benchmark for Interpreting Grounded Instructions for Everyday Tasks , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Alexander M. Rush,et al.  Bottom-Up Abstractive Summarization , 2018, EMNLP.

[39]  Matthew J. Hausknecht,et al.  TextWorld: A Learning Environment for Text-based Games , 2018, CGW@IJCAI.

[40]  Ross B. Girshick,et al.  Mask R-CNN , 2017, 1703.06870.

[41]  Geoffrey J. Gordon,et al.  A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning , 2010, AISTATS.

[42]  Thomas Wolf,et al.  DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter , 2019, ArXiv.

[43]  Matthew J. Hausknecht,et al.  Graph Constrained Reinforcement Learning for Natural Language Action Spaces , 2020, ICLR.

[44]  Li Fei-Fei,et al.  CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).