EARLY FUSION for Goal Directed Robotic Vision

Building perceptual systems for robotics which perform well under tight computational budgets requires novel architectures which rethink the traditional computer vision pipeline. Modern vision architectures require the agent to build a summary representation of the entire scene, even if most of the input is irrelevant to the agent’s current goal. In this work, we flip this paradigm, by introducing EARLYFUSION vision models that condition on a goal to build custom representations for downstream tasks. We show that these goal specific representations can be learned more quickly, are substantially more parameter efficient, and more robust than existing attention mechanisms in our domain. We demonstrate the effectiveness of these methods on a simulated item retrieval problem that is trained in a fully end-to-end manner via imitation learning.

[1]  Daniel Marcu,et al.  Natural Language Communication with Robots , 2016, NAACL.

[2]  Margaret Mitchell,et al.  VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[3]  Koray Kavukcuoglu,et al.  Visual Attention , 2020, Computational Models for Cognitive Vision.

[4]  Li Fei-Fei,et al.  CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Daniel Jurafsky,et al.  Learning to Follow Navigational Directions , 2010, ACL.

[6]  Demis Hassabis,et al.  Grounded Language Learning in a Simulated 3D World , 2017, ArXiv.

[7]  John Langford,et al.  Mapping Instructions and Visual Observations to Actions with Reinforcement Learning , 2017, EMNLP.

[8]  J. Wolfe,et al.  Guided Search 2.0 A revised model of visual search , 1994, Psychonomic bulletin & review.

[9]  Rahul Sukthankar,et al.  Cognitive Mapping and Planning for Visual Navigation , 2017, International Journal of Computer Vision.

[10]  Yoshua Bengio,et al.  Deep Sparse Rectifier Neural Networks , 2011, AISTATS.

[11]  Christopher D. Manning,et al.  Compositional Attention Networks for Machine Reasoning , 2018, ICLR.

[12]  Alan L. Yuille,et al.  Generation and Comprehension of Unambiguous Object Descriptions , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Dieter Fox,et al.  Following directions using statistical machine translation , 2010, 2010 5th ACM/IEEE International Conference on Human-Robot Interaction (HRI).

[14]  John K. Tsotsos,et al.  Modeling Visual Attention via Selective Tuning , 1995, Artif. Intell..

[15]  Kate Saenko,et al.  Ask, Attend and Answer: Exploring Question-Guided Spatial Attention for Visual Question Answering , 2015, ECCV.

[16]  Sergey Levine,et al.  End-to-End Training of Deep Visuomotor Policies , 2015, J. Mach. Learn. Res..

[17]  Vicente Ordonez,et al.  ReferItGame: Referring to Objects in Photographs of Natural Scenes , 2014, EMNLP.

[18]  Henrik I. Christensen,et al.  Computational visual attention systems and their cognitive foundations: A survey , 2010, TAP.

[19]  Thomas A. Funkhouser,et al.  MINOS: Multimodal Indoor Simulator for Navigation in Complex Environments , 2017, ArXiv.

[20]  Andrew Bennett,et al.  CHALET: Cornell House Agent Learning Environment , 2018, ArXiv.

[21]  Yuandong Tian,et al.  Building Generalizable Agents with a Realistic and Rich 3D Environment , 2018, ICLR.

[22]  Wei Xu,et al.  ABC-CNN: An Attention Based Convolutional Neural Network for Visual Question Answering , 2015, ArXiv.

[23]  Simone Frintrop,et al.  Goal-Directed Search with a Top-Down Modulated Computational Attention System , 2005, DAGM-Symposium.

[24]  Ali Farhadi,et al.  Target-driven visual navigation in indoor scenes using deep reinforcement learning , 2016, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[25]  Raymond J. Mooney,et al.  Learning to Interpret Natural Language Navigation Instructions from Observations , 2011, Proceedings of the AAAI Conference on Artificial Intelligence.

[26]  Jasdeep Singh,et al.  Attention on Attention: Architectures for Visual Question Answering (VQA) , 2018, ArXiv.

[27]  Ali Farhadi,et al.  IQA: Visual Question Answering in Interactive Environments , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[28]  Ross A. Knepper,et al.  Mapping Navigation Instructions to Continuous Control Actions with Position-Visitation Prediction , 2018, CoRL.

[29]  Razvan Pascanu,et al.  A simple neural network module for relational reasoning , 2017, NIPS.

[30]  Ali Farhadi,et al.  AI2-THOR: An Interactive 3D Environment for Visual AI , 2017, ArXiv.

[31]  Alex Graves,et al.  Recurrent Models of Visual Attention , 2014, NIPS.

[32]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[33]  Alex Graves,et al.  Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.

[34]  Wolfram Burgard,et al.  Imitation learning with generalized task descriptions , 2009, 2009 IEEE International Conference on Robotics and Automation.

[35]  Louis-Philippe Morency,et al.  Using Syntax to Ground Referring Expressions in Natural Images , 2018, AAAI.

[36]  Simon Brodeur,et al.  HoME: a Household Multimodal Environment , 2017, ICLR.

[37]  Samy Bengio,et al.  Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks , 2015, NIPS.

[38]  Qi Wu,et al.  Vision-and-Language Navigation: Interpreting Visually-Grounded Navigation Instructions in Real Environments , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[39]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[40]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[41]  Alexey Dosovitskiy,et al.  End-to-End Driving Via Conditional Imitation Learning , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[42]  Ming Liu,et al.  Virtual-to-real deep reinforcement learning: Continuous control of mobile robots for mapless navigation , 2017, 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[43]  Geoffrey J. Gordon,et al.  A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning , 2010, AISTATS.

[44]  Ross A. Knepper,et al.  Following High-level Navigation Instructions on a Simulated Quadcopter with Imitation Learning , 2018, Robotics: Science and Systems.

[45]  Andrew Bennett,et al.  Mapping Instructions to Actions in 3D Environments with Visual Goal Prediction , 2018, EMNLP.

[46]  Siddhartha S. Srinivasa,et al.  A System for Multi-step Mobile Manipulation: Architecture, Algorithms, and Experiments , 2016, ISER.

[47]  Dan Klein,et al.  Alignment-Based Compositional Semantics for Instruction Following , 2015, EMNLP.

[48]  Stefan Lee,et al.  Embodied Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[49]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[50]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[51]  Oliver Brock,et al.  Lessons from the Amazon Picking Challenge: Four Aspects of Building Robotic Systems , 2016, IJCAI.

[52]  Jason Yosinski,et al.  An Intriguing Failing of Convolutional Neural Networks and the CoordConv Solution , 2018, NeurIPS.

[53]  Luke S. Zettlemoyer,et al.  Weakly Supervised Learning of Semantic Parsers for Mapping Instructions to Actions , 2013, TACL.

[54]  Matthew R. Walter,et al.  Understanding Natural Language Commands for Robotic Navigation and Mobile Manipulation , 2011, AAAI.