Scene Memory Transformer for Embodied Agents in Long-Horizon Tasks

Many robotic applications require the agent to perform long-horizon tasks in partially observable environments. In such applications, decision making at any step can depend on observations received far in the past. Hence, being able to properly memorize and utilize the long-term history is crucial. In this work, we propose a novel memory-based policy, named Scene Memory Transformer (SMT). The proposed policy embeds and adds each observation to a memory and uses the attention mechanism to exploit spatio-temporal dependencies. This model is generic and can be efficiently trained with reinforcement learning over long episodes. On a range of visual navigation tasks, SMT demonstrates superior performance to existing reactive and memory-based policies by a margin.

[1]  Sergio Gomez Colmenarejo,et al.  Hybrid computing using a neural network with dynamic external memory , 2016, Nature.

[2]  Wolfram Burgard,et al.  Neural SLAM , 2017, ArXiv.

[3]  Yuandong Tian,et al.  Building Generalizable Agents with a Realistic and Rich 3D Environment , 2018, ICLR.

[4]  Jana Kosecka,et al.  Visual Representations for Semantic Target Driven Navigation , 2018, 2019 International Conference on Robotics and Automation (ICRA).

[5]  Jitendra Malik,et al.  Gibson Env: Real-World Perception for Embodied Agents , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[6]  Yee Whye Teh,et al.  Set Transformer , 2018, ArXiv.

[7]  Jason Weston,et al.  End-To-End Memory Networks , 2015, NIPS.

[8]  Vladlen Koltun,et al.  Semi-parametric Topological Memory for Navigation , 2018, ICLR.

[9]  Cyrill Stachniss,et al.  Simultaneous Localization and Mapping , 2016, Springer Handbook of Robotics, 2nd Ed..

[10]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[11]  Qi Wu,et al.  Vision-and-Language Navigation: Interpreting Visually-Grounded Navigation Instructions in Real Environments , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[12]  Rahul Sukthankar,et al.  Cognitive Mapping and Planning for Visual Navigation , 2017, International Journal of Computer Vision.

[13]  Quoc V. Le,et al.  Learning Longer-term Dependencies in RNNs with Auxiliary Losses , 2018, ICML.

[14]  Leslie Pack Kaelbling,et al.  Planning and Acting in Partially Observable Stochastic Domains , 1998, Artif. Intell..

[15]  Vijay Kumar,et al.  Memory Augmented Control Networks , 2017, ICLR.

[16]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[17]  Jitendra Malik,et al.  On Evaluation of Embodied Navigation Agents , 2018, ArXiv.

[18]  Avinash C. Kak,et al.  Vision for Mobile Robot Navigation: A Survey , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[19]  Ali Farhadi,et al.  Visual Semantic Planning Using Deep Successor Representations , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[20]  Geoffrey E. Hinton,et al.  Layer Normalization , 2016, ArXiv.

[21]  Thomas A. Funkhouser,et al.  Semantic Scene Completion from a Single Depth Image , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Razvan Pascanu,et al.  On the difficulty of training recurrent neural networks , 2012, ICML.

[23]  Wolfram Burgard,et al.  Neural SLAM: Learning to Explore with External Memory , 2017, 1706.09520.

[24]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[25]  Rob Fergus,et al.  Predicting Depth, Surface Normals and Semantic Labels with a Common Multi-scale Convolutional Architecture , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[26]  David Wooden,et al.  A guide to vision-based map building , 2006, IEEE Robotics & Automation Magazine.

[27]  Peter I. Corke,et al.  Vision-only autonomous navigation using topometric maps , 2013, 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[28]  Razvan Pascanu,et al.  Learning to Navigate in Complex Environments , 2016, ICLR.

[29]  Koray Kavukcuoglu,et al.  Neural scene representation and rendering , 2018, Science.

[30]  Ali Farhadi,et al.  Target-driven visual navigation in indoor scenes using deep reinforcement learning , 2016, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[31]  James J. Little,et al.  Autonomous vision-based exploration and mapping using hybrid maps and Rao-Blackwellised particle filters , 2006, 2006 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[32]  Ruslan Salakhutdinov,et al.  Neural Map: Structured Memory for Deep Reinforcement Learning , 2017, ICLR.

[33]  Anil A. Bharath,et al.  Deep Reinforcement Learning: A Brief Survey , 2017, IEEE Signal Processing Magazine.

[34]  Ali Farhadi,et al.  IQA: Visual Question Answering in Interactive Environments , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[35]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Sergey Levine,et al.  Learning deep neural network policies with continuous memory states , 2015, 2016 IEEE International Conference on Robotics and Automation (ICRA).

[37]  Masahiro Tomono,et al.  3-D Object Map Building Using Dense Object Models with SIFT-based Recognition Features , 2006, 2006 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[38]  Honglak Lee,et al.  Control of Memory, Active Perception, and Action in Minecraft , 2016, ICML.

[39]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[40]  Jitendra Malik,et al.  Unifying Map and Landmark Based Representations for Visual Navigation , 2017, ArXiv.

[41]  Jürgen Schmidhuber,et al.  Recurrent policy gradients , 2010, Log. J. IGPL.

[42]  Wojciech Jaskowski,et al.  ViZDoom: A Doom-based AI research platform for visual reinforcement learning , 2016, 2016 IEEE Conference on Computational Intelligence and Games (CIG).

[43]  Sumetee kesorn Visual Navigation for Mobile Robots: a Survey , 2012 .

[44]  Ruslan Salakhutdinov,et al.  Active Neural Localization , 2018, ICLR.

[45]  Oliver Brock,et al.  Differentiable Particle Filters: End-to-End Learning with Algorithmic Priors , 2018, Robotics: Science and Systems.

[46]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[47]  Alex Graves,et al.  Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[48]  Leonidas J. Guibas,et al.  PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space , 2017, NIPS.

[49]  Bram Bakker,et al.  Reinforcement Learning with Long Short-Term Memory , 2001, NIPS.

[50]  Marc Carreras,et al.  A survey on coverage path planning for robotics , 2013, Robotics Auton. Syst..

[51]  Lukasz Kaiser,et al.  Generating Wikipedia by Summarizing Long Sequences , 2018, ICLR.

[52]  Andrew J. Davison,et al.  Real-time simultaneous localisation and mapping with a single camera , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[53]  Andrea Vedaldi,et al.  MapNet: An Allocentric Spatial Memory for Mapping Environments , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.