Visual Representation Learning with 3D View-Contrastive Inverse Graphics Networks

Predictive coding theories suggest that the brain learns by predicting observations at various levels of abstraction. One of the most basic prediction tasks is view prediction: how would a given scene look from an alternative viewpoint? Humans excel at this task. Our ability to imagine and fill in missing visual information is tightly coupled with perception: we feel as if we see the world in 3 dimensions, while in fact, information from only the front surface of the world hits our (2D) retinas. This paper explores the connection between view-predictive representation learning and its role in the development of 3D visual recognition. We propose inverse graphics networks, which take as input 2.5D video streams captured by a moving camera, and map to stable 3D feature maps of the scene, by disentangling the scene content from the motion of the camera. The model can also project its 3D feature maps to novel viewpoints, to predict and match against target views. We propose contrastive prediction losses that can handle stochasticity of the visual input and can scale view-predictive learning to more photorealistic scenes than those considered in previous works. We show that the proposed model learns 3D visual representations useful for (1) semi-supervised learning of 3D object detectors, and (2) unsupervised learning of 3D moving object detectors, by estimating motion of the inferred 3D feature maps in videos of dynamic scenes. To the best of our knowledge, this is the first work that empirically shows view prediction to be a useful and scalable self-supervised task beneficial to 3D object detection.

[1]  Germán Ros,et al.  CARLA: An Open Urban Driving Simulator , 2017, CoRL.

[2]  Andrew W. Fitzgibbon,et al.  SphereFlow: 6 DoF Scene Flow from RGB-D Pairs , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[3]  Floris P. de Lange,et al.  Expectations accelerate entry of visual stimuli into awareness. , 2015, Journal of vision.

[4]  Jan Kautz,et al.  PWC-Net: CNNs for Optical Flow Using Pyramid, Warping, and Cost Volume , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[5]  Oriol Vinyals,et al.  Representation Learning with Contrastive Predictive Coding , 2018, ArXiv.

[6]  Silvio Savarese,et al.  Deep Metric Learning via Lifted Structured Feature Embedding , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Rodney A. Brooks,et al.  Intelligence Without Reason , 1991, IJCAI.

[8]  Gordon Wetzstein,et al.  DeepVoxels: Learning Persistent 3D Feature Embeddings , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  James Philbin,et al.  FaceNet: A unified embedding for face recognition and clustering , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Juan D. Tardós,et al.  ORB-SLAM2: An Open-Source SLAM System for Monocular, Stereo, and RGB-D Cameras , 2016, IEEE Transactions on Robotics.

[11]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Sergey Levine,et al.  End-to-End Training of Deep Visuomotor Policies , 2015, J. Mach. Learn. Res..

[13]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[14]  Berthold K. P. Horn,et al.  Determining Optical Flow , 1981, Other Conferences.

[15]  M. Wibral,et al.  The Faces of Predictive Coding , 2015, The Journal of Neuroscience.

[16]  Leonidas J. Guibas,et al.  ShapeNet: An Information-Rich 3D Model Repository , 2015, ArXiv.

[17]  Nuno Vasconcelos,et al.  Spatiotemporal Saliency in Dynamic Scenes , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[18]  A. Patla Visual control of human locomotion. , 1991 .

[19]  Konstantinos G. Derpanis,et al.  Back to Basics: Unsupervised Learning of Optical Flow via Brightness Constancy and Motion Smoothness , 2016, ECCV Workshops.

[20]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[21]  Jitendra Malik,et al.  Learning a Multi-View Stereo Machine , 2017, NIPS.

[22]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[23]  J. Gibson The Ecological Approach to Visual Perception , 1979 .

[24]  M. Garrett,et al.  Perceptual knowledge of objects in infancy , 1982 .

[25]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[26]  Rahul Sukthankar,et al.  Cognitive Mapping and Planning for Visual Navigation , 2017, International Journal of Computer Vision.

[27]  Jitendra Malik,et al.  Learning to segment moving objects in videos , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Katerina Fragkiadaki,et al.  Learning Spatial Common Sense With Geometry-Aware Recurrent Networks , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Qiang Xu,et al.  nuScenes: A Multimodal Dataset for Autonomous Driving , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Juan Carlos Niebles,et al.  Learning to Decompose and Disentangle Representations for Video Prediction , 2018, NeurIPS.

[31]  Hao Li,et al.  PIFu: Pixel-Aligned Implicit Function for High-Resolution Clothed Human Digitization , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[32]  C Loehlin John,et al.  Latent variable models: an introduction to factor, path, and structural analysis , 1986 .

[33]  Scott P. Johnson,et al.  Development of three-dimensional object completion in infancy. , 2008, Child development.

[34]  Jitendra Malik,et al.  Object Segmentation by Long Term Analysis of Point Trajectories , 2010, ECCV.

[35]  Lawrence G. Roberts,et al.  Machine Perception of Three-Dimensional Solids , 1963, Outstanding Dissertations in the Computer Sciences.

[36]  Rajesh P. N. Rao,et al.  Predictive coding in the visual cortex: a functional interpretation of some extra-classical receptive-field effects. , 1999 .

[37]  Koray Kavukcuoglu,et al.  Neural scene representation and rendering , 2018, Science.

[38]  Thomas Brox,et al.  Object segmentation in video: A hierarchical variational approach for turning point trajectories into dense regions , 2011, 2011 International Conference on Computer Vision.

[39]  Andreas Geiger,et al.  Vision meets robotics: The KITTI dataset , 2013, Int. J. Robotics Res..

[40]  Michael J. Black,et al.  Secrets of optical flow estimation and their principles , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[41]  Simon Lucey,et al.  Inverse Compositional Spatial Transformer Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[42]  Thomas Brox,et al.  Single-view to Multi-view: Reconstructing Unseen Views with a Convolutional Network , 2015, ArXiv.

[43]  Karl J. Friston Learning and inference in the brain , 2003, Neural Networks.

[44]  James L. McClelland,et al.  An interactive activation model of context effects in letter perception: I. An account of basic findings. , 1981 .

[45]  Yee Whye Teh,et al.  Sequential Attend, Infer, Repeat: Generative Modelling of Moving Objects , 2018, NeurIPS.

[46]  Alex Graves,et al.  Conditional Image Generation with PixelCNN Decoders , 2016, NIPS.

[47]  Thomas Brox,et al.  Lucid Data Dreaming for Object Tracking , 2017, ArXiv.

[48]  Nuno Vasconcelos,et al.  On the plausibility of the discriminant center-surround hypothesis for visual saliency. , 2008, Journal of vision.

[49]  Simon Baker,et al.  Lucas-Kanade 20 Years On: A Unifying Framework , 2004, International Journal of Computer Vision.

[50]  Andrea Vedaldi,et al.  MapNet: An Allocentric Spatial Memory for Mapping Environments , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[51]  Bernhard P. Wrobel,et al.  Multiple View Geometry in Computer Vision , 2001 .

[52]  Alexander J. Smola,et al.  Sampling Matters in Deep Embedding Learning , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[53]  Noah Snavely,et al.  Unsupervised Learning of Depth and Ego-Motion from Video , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[54]  Peter Dayan,et al.  A Neural Substrate of Prediction and Reward , 1997, Science.

[55]  Kihyuk Sohn,et al.  Improved Deep Metric Learning with Multi-class N-pair Loss Objective , 2016, NIPS.

[56]  Alexei A. Efros,et al.  Unsupervised Visual Representation Learning by Context Prediction , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).