Learning to Act with Affordance-Aware Multimodal Neural SLAM

Recent years have witnessed an emerging paradigm shift toward embodied artificial intelligence, in which an agent must learn to solve challenging tasks by interacting with its environment. There are several challenges in solving embodied multimodal tasks, including long-horizon planning, vision-and-language grounding, and efficient exploration. We focus on a critical bottleneck, namely the performance of planning and navigation. To tackle this challenge, we propose a Neural SLAM approach that, for the first time, utilizes several modalities for exploration, predicts an affordance-aware semantic map, and plans over it at the same time. This significantly improves exploration efficiency, leads to robust long-horizon planning, and enables effective vision-and-language grounding. With the proposed Affordance-aware Multimodal Neural SLAM (AMSLAM) approach, we obtain more than 40% improvement over prior published work on the ALFRED benchmark and set a new state-of-the-art generalization performance at a success rate of 23.48% on the test unseen scenes.

[1]  Jitendra Malik,et al.  Differentiable Spatial Planning using Transformers , 2021, ICML.

[2]  Shimon Whiteson,et al.  Optimistic Exploration even with a Pessimistic Initialisation , 2020, ICLR.

[3]  Kristen Grauman,et al.  Learning Affordance Landscapes for Interaction Exploration in 3D Environments , 2020, NeurIPS.

[4]  Tom Schaul,et al.  Unifying Count-Based Exploration and Intrinsic Motivation , 2016, NIPS.

[5]  Chuang Gan,et al.  ThreeDWorld: A Platform for Interactive Multi-Modal Physical Simulation , 2020, ArXiv.

[6]  Alexei A. Efros,et al.  Large-Scale Study of Curiosity-Driven Learning , 2018, ICLR.

[7]  Cordelia Schmid,et al.  Episodic Transformer for Vision-and-Language Navigation , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[8]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[9]  Roozbeh Mottaghi,et al.  ALFRED: A Benchmark for Interpreting Grounded Instructions for Everyday Tasks , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Silvio Savarese,et al.  iGibson 2.0: Object-Centric Simulation for Robot Learning of Everyday Household Tasks , 2021, CoRL.

[11]  Silvio Savarese,et al.  Deep Affordance Foresight: Planning Through What Can Be Done in the Future , 2020, 2021 IEEE International Conference on Robotics and Automation (ICRA).

[12]  Qi Wu,et al.  Vision-and-Language Navigation: Interpreting Visually-Grounded Navigation Instructions in Real Environments , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[13]  Ali Farhadi,et al.  AI2-THOR: An Interactive 3D Environment for Visual AI , 2017, ArXiv.

[14]  Marc G. Bellemare,et al.  Count-Based Exploration with Neural Density Models , 2017, ICML.

[15]  Masanori Suganuma,et al.  Look Wide and Interpret Twice: Improving Performance on Interactive Instruction-following Tasks , 2021, IJCAI.

[16]  Sanja Fidler,et al.  VirtualHome: Simulating Household Activities Via Programs , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[17]  Alessandro Suglia,et al.  Embodied BERT: A Transformer Model for Embodied, Language-guided Visual Task Completion , 2021, ArXiv.

[18]  Yoav Artzi,et al.  TOUCHDOWN: Natural Language Navigation and Spatial Reasoning in Visual Street Environments , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Yichi Zhang,et al.  Hierarchical Task Learning from Language Instructions with Unified Transformers and Self-Monitoring , 2021, FINDINGS.

[20]  Stefan Lee,et al.  Embodied Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[21]  Gaurav S. Sukhatme,et al.  Interactive affordance map building for a robotic task , 2015, 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[22]  Stefan Lee,et al.  ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks , 2019, NeurIPS.

[23]  Ruslan Salakhutdinov,et al.  Learning to Explore using Active Neural SLAM , 2020, ICLR.

[24]  Jitendra Malik,et al.  Habitat: A Platform for Embodied AI Research , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[25]  Ruslan Salakhutdinov,et al.  Neural Topological SLAM for Visual Navigation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Kunal Pratap Singh,et al.  Agent with the Big Picture: Perceiving Surroundings for Interactive Instruction Following , 2021 .

[27]  Alexei A. Efros,et al.  Curiosity-Driven Exploration by Self-Supervised Prediction , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[28]  Jürgen Schmidhuber,et al.  Curious model-building control systems , 1991, [Proceedings] 1991 IEEE International Joint Conference on Neural Networks.

[29]  Ross B. Girshick,et al.  Mask R-CNN , 2017, 1703.06870.

[30]  Ruslan Salakhutdinov,et al.  Object Goal Navigation using Goal-Oriented Semantic Exploration , 2020, NeurIPS.

[31]  Filip De Turck,et al.  #Exploration: A Study of Count-Based Exploration for Deep Reinforcement Learning , 2016, NIPS.

[32]  Kyunghyun Cho,et al.  Augmentation for small object detection , 2019, 9th International Conference on Advances in Computing and Information Technology (ACITY 2019).

[33]  Santhosh K. Ramakrishnan,et al.  An Exploration of Embodied Visual Exploration , 2021, Int. J. Comput. Vis..

[34]  Jonghyun Choi,et al.  MOCA: A Modular Object-Centric Approach for Interactive Instruction Following , 2020, ArXiv.

[35]  Deva Ramanan,et al.  Learning to Move with Affordance Maps , 2020, ICLR.

[36]  Dieter Fox,et al.  A Persistent Spatial Semantic Representation for High-level Natural Language Instruction Execution , 2021, CoRL.

[37]  Mark O. Riedl,et al.  Guiding Reinforcement Learning Exploration Using Natural Language , 2017, AAMAS.

[38]  Andrew Zisserman,et al.  Spatial Transformer Networks , 2015, NIPS.

[39]  Kristen Grauman,et al.  Learning to Look Around: Intelligently Exploring Unseen Environments for Unknown Tasks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.