Hierarchical Control of Situated Agents through Natural Language

When humans conceive how to perform a particular task, they do so hierarchically: splitting higher-level tasks into smaller sub-tasks. However, in the literature on natural language (NL) command of situated agents, most works have treated the procedures to be executed as flat sequences of simple actions, or any hierarchies of procedures have been shallow at best. In this paper, we propose a formalism of procedures as programs, a powerful yet intuitive method of representing hierarchical procedural knowledge for agent command and control. We further propose a modeling paradigm of hierarchical modular networks, which consist of a planner and reactors that convert NL intents to predictions of executable programs and probe the environment for information necessary to complete the program execution. We instantiate this framework on the IQA and ALFRED datasets for NL instruction following. Our model outperforms reactive baselines by a large margin on both datasets. We also demonstrate that our framework is more data-efficient, and that it allows for fast iterative development.

[1]  Xiaojun Chang,et al.  Vision-Language Navigation With Self-Supervised Auxiliary Reasoning Tasks , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Armando Solar-Lezama,et al.  DreamCoder: growing generalizable, interpretable knowledge with wake–sleep Bayesian program learning , 2020, Philosophical Transactions of the Royal Society A.

[3]  Marc Brockschmidt,et al.  Differentiable Programs with Neural Libraries , 2016, ICML.

[4]  Xin Zhou,et al.  Mapping Natural Language Instructions to Mobile UI Action Sequences , 2020, ACL.

[5]  Roozbeh Mottaghi,et al.  ALFRED: A Benchmark for Interpreting Grounded Instructions for Everyday Tasks , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Tao Yu,et al.  Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task , 2018, EMNLP.

[7]  Alvin Cheung,et al.  Learning a Neural Semantic Parser from User Feedback , 2017, ACL.

[8]  Jiajun Wu,et al.  Learning to See Physics via Visual De-animation , 2017, NIPS.

[9]  Xinlei Chen,et al.  Multi-Target Embodied Question Answering , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Matthew J. Hausknecht,et al.  ALFWorld: Aligning Text and Embodied Environments for Interactive Learning , 2020, ICLR.

[11]  Navdeep Jaitly,et al.  Pointer Networks , 2015, NIPS.

[12]  Ali Farhadi,et al.  IQA: Visual Question Answering in Interactive Environments , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[13]  Doina Precup,et al.  Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning , 1999, Artif. Intell..

[14]  Mukund Raghothaman,et al.  SWIM: Synthesizing What I Mean - Code Search and Idiomatic Snippet Synthesis , 2015, 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE).

[15]  Jonghyun Choi,et al.  MOCA: A Modular Object-Centric Approach for Interactive Instruction Following , 2020, ArXiv.

[16]  Edward Grefenstette,et al.  RTFM: Generalising to Novel Environment Dynamics via Reading , 2020, ICLR.

[17]  Andrew Bennett,et al.  Mapping Instructions to Actions in 3D Environments with Visual Goal Prediction , 2018, EMNLP.

[18]  Graham Neubig,et al.  A Syntactic Neural Model for General-Purpose Code Generation , 2017, ACL.

[19]  Oleksandr Polozov,et al.  Program Synthesis and Semantic Parsing with Learned Code Idioms , 2019, NeurIPS.

[20]  Luke S. Zettlemoyer,et al.  Weakly Supervised Learning of Semantic Parsers for Mapping Instructions to Actions , 2013, TACL.

[21]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[22]  Luke S. Zettlemoyer,et al.  Learning to Map Sentences to Logical Form: Structured Classification with Probabilistic Categorial Grammars , 2005, UAI.

[23]  Pieter Abbeel,et al.  Using Classical Planners for Tasks with Continuous Operators in Robotics , 2013 .

[24]  Moritz Tenorth,et al.  Understanding and executing instructions for everyday manipulation tasks from the World Wide Web , 2010, 2010 IEEE International Conference on Robotics and Automation.

[25]  Sanja Fidler,et al.  VirtualHome: Simulating Household Activities Via Programs , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[26]  Ghassan Al-Regib,et al.  Self-Monitoring Navigation Agent via Auxiliary Progress Estimation , 2019, ICLR.

[27]  Rahul Sukthankar,et al.  Cognitive Mapping and Planning for Visual Navigation , 2017, International Journal of Computer Vision.

[28]  Chelsea Finn,et al.  Language as an Abstraction for Hierarchical Deep Reinforcement Learning , 2019, NeurIPS.

[29]  Alvin Cheung,et al.  Learning Programmatic Idioms for Scalable Semantic Parsing , 2019, EMNLP.

[30]  Dorsa Sadigh,et al.  Learning Adaptive Language Interfaces through Decomposition , 2020, INTEXSEMPAR.

[31]  Silvio Savarese,et al.  Regression Planning Networks , 2019, NeurIPS.

[32]  Dan Klein,et al.  Neural Module Networks , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Dan Klein,et al.  Modular Multitask Reinforcement Learning with Policy Sketches , 2016, ICML.

[34]  Ross B. Girshick,et al.  Mask R-CNN , 2017, 1703.06870.

[35]  Doina Precup,et al.  The Option-Critic Architecture , 2016, AAAI.

[36]  Arthur Szlam,et al.  CraftAssist Instruction Parsing: Semantic Parsing for a Minecraft Assistant , 2019, ArXiv.

[37]  Chen Liang,et al.  Compositional Generalization via Neural-Symbolic Stack Machines , 2020, NeurIPS.

[38]  Moritz Tenorth,et al.  Putting People's Common Sense into Knowledge Bases of Household Robots , 2010, KI.

[39]  Dan Klein,et al.  Abstract Syntax Networks for Code Generation and Semantic Parsing , 2017, ACL.

[40]  Leslie Pack Kaelbling,et al.  Hierarchical task and motion planning in the now , 2011, 2011 IEEE International Conference on Robotics and Automation.

[41]  Richard Fikes,et al.  STRIPS: A New Approach to the Application of Theorem Proving to Problem Solving , 1971, IJCAI.

[42]  Petter Ögren,et al.  Behavior Trees in Robotics and AI: An Introduction , 2017, ArXiv.

[43]  Sanja Fidler,et al.  Synthesizing Environment-Aware Activities via Activity Sketches , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[45]  Richard Socher,et al.  Seq2SQL: Generating Structured Queries from Natural Language using Reinforcement Learning , 2018, ArXiv.

[46]  Stefan Lee,et al.  Neural Modular Control for Embodied Question Answering , 2018, CoRL.

[47]  Kevin Lee,et al.  Tell me Dave: Context-sensitive grounding of natural language to manipulation instructions , 2014, Int. J. Robotics Res..

[48]  Yoav Artzi,et al.  Learning Compact Lexicons for CCG Semantic Parsing , 2014, EMNLP.

[49]  Joseph J. Lim,et al.  Program Guided Agent , 2020, ICLR.

[50]  Qi Wu,et al.  Vision-and-Language Navigation: Interpreting Visually-Grounded Navigation Instructions in Real Environments , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[51]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[52]  Ali Farhadi,et al.  YOLOv3: An Incremental Improvement , 2018, ArXiv.

[53]  Yoav Artzi,et al.  Executing Instructions in Situated Collaborative Interactions , 2019, EMNLP.

[54]  Raymond J. Mooney,et al.  Learning to Interpret Natural Language Navigation Instructions from Observations , 2011, Proceedings of the AAAI Conference on Artificial Intelligence.

[55]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[56]  Larry Heck,et al.  Grounding Open-Domain Instructions to Automate Web Support Tasks , 2021, NAACL.

[57]  Yuandong Tian,et al.  Hierarchical Decision Making by Generating and Following Natural Language Instructions , 2019, NeurIPS.