Learning Spatial Common Sense With Geometry-Aware Recurrent Networks

We integrate two powerful ideas, geometry and deep visual representation learning, into recurrent network architectures for mobile visual scene understanding. The proposed networks learn to “lift” 2D visual features and integrate them over time into latent 3D feature maps of the scene. They are equipped with differentiable geometric operations, such as projection, unprojection, egomotion estimation and stabilization, in order to compute a geometrically-consistent mapping between the world scene and their 3D latent feature space. We train the proposed architectures to predict novel image views given short frame sequences as input. Their predictions strongly generalize to scenes with a novel number of objects, appearances and configurations, and greatly outperform predictions of previous works that do not consider egomotion stabilization or a space-aware latent feature space. We train the proposed architectures to detect and segment objects in 3D, using the latent 3D feature map as input—as opposed to 2D feature maps computed from video frames. The resulting detections are permanent: they continue to exist even when an object gets occluded or leaves the field of view. Our experiments suggest the proposed space-aware latent feature arrangement and egomotion-stabilized convolutions are essential architectural choices for spatial common sense to emerge in artificial embodied visual agents.

[1]  Rahul Sukthankar,et al.  Cognitive Mapping and Planning for Visual Navigation , 2017, International Journal of Computer Vision.

[2]  Ross B. Girshick,et al.  Mask R-CNN , 2017, 1703.06870.

[3]  Katerina Fragkiadaki,et al.  Geometry-Aware Recurrent Neural Networks for Active Visual Recognition , 2018, NeurIPS.

[4]  Oisin Mac Aodha,et al.  Unsupervised Monocular Depth Estimation with Left-Right Consistency , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Thomas Brox,et al.  U-Net: Convolutional Networks for Biomedical Image Segmentation , 2015, MICCAI.

[6]  Yuandong Tian,et al.  3D Interpreter Networks for Viewer-Centered Wireframe Modeling , 2018, International Journal of Computer Vision.

[7]  Koray Kavukcuoglu,et al.  Neural scene representation and rendering , 2018, Science.

[8]  Daniel Cremers,et al.  Semi-dense visual odometry for AR on a smartphone , 2014, 2014 IEEE International Symposium on Mixed and Augmented Reality (ISMAR).

[9]  Andrea Vedaldi,et al.  MapNet: An Allocentric Spatial Memory for Mapping Environments , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[10]  W. Marsden I and J , 2012 .

[11]  Kristen Grauman,et al.  Learning image representations equivariant to ego-motion , 2015, ArXiv.

[12]  L. Regolin,et al.  Perception of partly occluded objects by young chicks , 1995, Perception & psychophysics.

[13]  Ruslan Salakhutdinov,et al.  Neural Map: Structured Memory for Deep Reinforcement Learning , 2017, ICLR.

[14]  Jitendra Malik,et al.  Learning a Multi-View Stereo Machine , 2017, NIPS.

[15]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[16]  Noah Snavely,et al.  Unsupervised Learning of Depth and Ego-Motion from Video , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Ali Farhadi,et al.  IQA: Visual Question Answering in Interactive Environments , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[18]  Bin Yang,et al.  Deep Continuous Fusion for Multi-sensor 3D Object Detection , 2018, ECCV.

[19]  Jitendra Malik,et al.  Learning to See by Moving , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[20]  Daniel Cremers,et al.  Dense visual SLAM for RGB-D cameras , 2013, 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[21]  Thomas Brox,et al.  Single-view to Multi-view: Reconstructing Unseen Views with a Convolutional Network , 2015, ArXiv.

[22]  Jiajun Wu,et al.  MarrNet: 3D Shape Reconstruction via 2.5D Sketches , 2017, NIPS.

[23]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[24]  Bin Yang,et al.  HDNET: Exploiting HD Maps for 3D Object Detection , 2018, CoRL.

[25]  Cordelia Schmid,et al.  SfM-Net: Learning of Structure and Motion from Video , 2017, ArXiv.

[26]  Leonidas J. Guibas,et al.  ShapeNet: An Information-Rich 3D Model Repository , 2015, ArXiv.

[27]  Andrea Vedaldi,et al.  Learning 3D Object Categories by Looking Around Them , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[28]  Yin Zhou,et al.  VoxelNet: End-to-End Learning for Point Cloud Based 3D Object Detection , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[29]  Rob Fergus,et al.  Depth Map Prediction from a Single Image using a Multi-Scale Deep Network , 2014, NIPS.

[30]  Katerina Fragkiadaki,et al.  Adversarial Inverse Graphics Networks: Learning 2D-to-3D Lifting and Image-to-Image Translation from Unpaired Supervision , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[31]  Joshua B. Tenenbaum,et al.  Towards common-sense reasoning via conditional simulation: legacies of Turing in Artificial Intelligence , 2012, Turing's Legacy.

[32]  Dieter Fox,et al.  DA-RNN: Semantic Mapping with Data Associated Recurrent Neural Networks , 2017, Robotics: Science and Systems.

[33]  Stefan Leutenegger,et al.  Fusion++: Volumetric Object-Level SLAM , 2018, 2018 International Conference on 3D Vision (3DV).