Geometry-Aware Recurrent Neural Networks for Active Visual Recognition

We present recurrent geometry-aware neural networks that integrate visual in- formation across multiple views of a scene into 3D latent feature tensors, while maintaining an one-to-one mapping between 3D physical locations in the world scene and latent feature locations. Object detection, object segmentation, and 3D reconstruction is then carried out directly using the constructed 3D feature memory, as opposed to any of the input 2D images. The proposed models are equipped with differentiable egomotion-aware feature warping and (learned) depth-aware unprojection operations to achieve geometrically consistent mapping between the features in the input frame and the constructed latent model of the scene. We empirically show the proposed model generalizes much better than geometry- unaware LSTM/GRU networks, especially under the presence of multiple objects and cross-object occlusions. Combined with active view selection policies, our model learns to select informative viewpoints to integrate information from by “undoing" cross-object occlusions, seamlessly combining geometry with learning from experience.

[1]  R. Held,et al.  MOVEMENT-PRODUCED STIMULATION IN THE DEVELOPMENT OF VISUALLY GUIDED BEHAVIOR. , 1963, Journal of comparative and physiological psychology.

[2]  Miss A.O. Penney (b) , 1974, The New Yale Book of Quotations.

[3]  Yiannis Aloimonos,et al.  Active vision , 2004, International Journal of Computer Vision.

[4]  John K. Tsotsos,et al.  Active object recognition , 1992, Proceedings 1992 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[5]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[6]  Brian D. Ehret Learning where to look , 1999, CHI EA '99.

[7]  Joachim Denzler,et al.  Information theoretic focal length selection for real-time active 3D object tracking , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[8]  E. Rivlin,et al.  Control of a Camera for Active Vision: Foveal Vision, Smooth Tracking and Saccade , 2000, International Journal of Computer Vision.

[9]  Yann LeCun,et al.  Dimensionality Reduction by Learning an Invariant Mapping , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[10]  Stefano Soatto,et al.  Actionable information in vision , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[11]  Nicholas J. Butko,et al.  Active perception , 2010 .

[12]  D. Ballard,et al.  Eye guidance in natural vision: reinterpreting salience. , 2011, Journal of vision.

[13]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[14]  W. Marsden I and J , 2012 .

[15]  Daniel Cremers,et al.  Dense visual SLAM for RGB-D cameras , 2013, 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[16]  Leslie Pack Kaelbling,et al.  Constructing Semantic World Models from Partial Views , 2013, RSS 2013.

[17]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[18]  Alan Yuille,et al.  Active Vision , 2014, Computer Vision, A Reference Guide.

[19]  Daniel Cremers,et al.  Semi-dense visual odometry for AR on a smartphone , 2014, 2014 IEEE International Symposium on Mixed and Augmented Reality (ISMAR).

[20]  Alex Graves,et al.  Recurrent Models of Visual Attention , 2014, NIPS.

[21]  Svetlana Lazebnik,et al.  Active Object Localization with Deep Reinforcement Learning , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[22]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[23]  Jianxiong Xiao,et al.  3D ShapeNets: A deep representation for volumetric shapes , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Leonidas J. Guibas,et al.  ShapeNet: An Information-Rich 3D Model Repository , 2015, ArXiv.

[25]  Abel Gonzalez-Garcia,et al.  An active search strategy for efficient object class detection , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Iasonas Kokkinos,et al.  Learning Dense Convolutional Embeddings for Semantic Segmentation , 2015, ArXiv.

[27]  Javier R. Movellan,et al.  Deep Q-learning for Active Recognition of GERMS: Baseline performance on a standardized dataset for active learning , 2015, BMVC.

[28]  Stefan Leutenegger,et al.  Pairwise Decomposition of Image Sequences for Active Multi-view Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Cristian Sminchisescu,et al.  Reinforcement Learning for Visual Object Detection , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Kristen Grauman,et al.  Look-Ahead Before You Leap: End-to-End Active Recognition by Forecasting the Effect of Motion , 2016, ECCV.

[31]  Jana Kosecka,et al.  A dataset for developing and benchmarking active vision , 2017, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[32]  Noah Snavely,et al.  Unsupervised Learning of Depth and Ego-Motion from Video , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Oisin Mac Aodha,et al.  Unsupervised Monocular Depth Estimation with Left-Right Consistency , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Jitendra Malik,et al.  Learning a Multi-View Stereo Machine , 2017, NIPS.

[35]  Ali Farhadi,et al.  Target-driven visual navigation in indoor scenes using deep reinforcement learning , 2016, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[36]  Thomas A. Funkhouser,et al.  Semantic Scene Completion from a Single Depth Image , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Cordelia Schmid,et al.  SfM-Net: Learning of Structure and Motion from Video , 2017, ArXiv.

[38]  Katerina Fragkiadaki,et al.  Adversarial Inverse Graphics Networks: Learning 2D-to-3D Lifting and Image-to-Image Translation from Unpaired Supervision , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[39]  Razvan Pascanu,et al.  Learning to Navigate in Complex Environments , 2016, ICLR.

[40]  Leonidas J. Guibas,et al.  PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Jiajun Wu,et al.  MarrNet: 3D Shape Reconstruction via 2.5D Sketches , 2017, NIPS.

[42]  Kristen Grauman,et al.  Learning to look around , 2017, ArXiv.

[43]  Ruslan Salakhutdinov,et al.  Neural Map: Structured Memory for Deep Reinforcement Learning , 2017, ICLR.

[44]  Andrea Vedaldi,et al.  MapNet: An Allocentric Spatial Memory for Mapping Environments , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[45]  Taesung Park,et al.  CyCADA: Cycle-Consistent Adversarial Domain Adaptation , 2017, ICML.

[46]  Yuandong Tian,et al.  3D Interpreter Networks for Viewer-Centered Wireframe Modeling , 2018, International Journal of Computer Vision.

[47]  Ali Farhadi,et al.  IQA: Visual Question Answering in Interactive Environments , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[48]  Jitendra Malik,et al.  Multi-view Supervision for Single-View Reconstruction via Differentiable Ray Consistency , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[49]  Rahul Sukthankar,et al.  Cognitive Mapping and Planning for Visual Navigation , 2017, International Journal of Computer Vision.