Spherical View Synthesis for Self-Supervised 360° Depth Estimation

Learning based approaches for depth perception are limited by the availability of clean training data. This has led to the utilization of view synthesis as an indirect objective for learning depth estimation using efficient data acquisition procedures. Nonetheless, most research focuses on pinhole based monocular vision, with scarce works presenting results for omnidirectional input. In this work, we explore spherical view synthesis for learning monocular 360 depth in a self-supervised manner and demonstrate its feasibility. Under a purely geometrically derived formulation we present results for horizontal and vertical baselines, as well as for the trinocular case. Further, we show how to better exploit the expressiveness of traditional CNNs when applied to the equirectangular domain in an efficient manner. Finally, given the availability of ground truth depth data, our work is uniquely positioned to compare view synthesis against direct supervision in a consistent and fair manner. The results indicate that alternative research directions might be better suited to enable higher quality depth perception. Our data, models and code are publicly available at https://vcl3d.github.io/SphericalViewSynthesis/.

[1]  Noah Snavely,et al.  Unsupervised Learning of Depth and Ego-Motion from Video , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Jason Yosinski,et al.  An Intriguing Failing of Convolutional Neural Networks and the CoordConv Solution , 2018, NeurIPS.

[3]  Andreas Geiger,et al.  SphereNet: Learning Spherical Representations for Detection and Classification in Omnidirectional Images , 2018, ECCV.

[4]  Petros Daras,et al.  OmniDepth: Dense Depth Estimation for Indoors Spherical Panoramas , 2018, ECCV.

[5]  Javier Civera,et al.  Corners for Layout: End-to-End Layout Recovery From 360 Images , 2019, IEEE Robotics and Automation Letters.

[6]  Min Sun,et al.  Self-supervised Learning of Depth and Camera Motion from 360 ^\circ Videos , 2018, ACCV.

[7]  Nassir Navab,et al.  Deeper Depth Prediction with Fully Convolutional Residual Networks , 2016, 2016 Fourth International Conference on 3D Vision (3DV).

[8]  Min Sun,et al.  Cube Padding for Weakly-Supervised Saliency Prediction in 360° Videos , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[9]  Rob Fergus,et al.  Depth Map Prediction from a Single Image using a Multi-Scale Deep Network , 2014, NIPS.

[10]  Toby P. Breckon,et al.  Real-Time Low-Cost Omni-Directional Stereo Vision via Bi-polar Spherical Cameras , 2018, ICIAR.

[11]  Jian Sun,et al.  Identity Mappings in Deep Residual Networks , 2016, ECCV.

[12]  Derek Hoiem,et al.  Indoor Segmentation and Support Inference from RGBD Images , 2012, ECCV.

[13]  Shang-Ta Yang,et al.  Self-Supervised Learning of Depth and Camera Motion from 360{\deg} Videos , 2018 .

[14]  Kristen Grauman,et al.  Kernel Transformer Networks for Compact Spherical Convolution , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[16]  Max Welling,et al.  Spherical CNNs , 2018, ICLR.

[17]  Thomas A. Funkhouser,et al.  Semantic Scene Completion from a Single Depth Image , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Vignesh Prasad,et al.  SfMLearner++: Learning Monocular Depth & Ego-Motion Using Meaningful Geometric Constraints , 2018, 2019 IEEE Winter Conference on Applications of Computer Vision (WACV).

[19]  Shenghua Gao,et al.  Saliency Detection in 360 ◦ Videos , 2022 .

[20]  Matthias Nießner,et al.  Matterport3D: Learning from RGB-D Data in Indoor Environments , 2017, 2017 International Conference on 3D Vision (3DV).

[21]  Shigang Li,et al.  Binocular Spherical Stereo , 2008, IEEE Transactions on Intelligent Transportation Systems.

[22]  Ersin Yumer,et al.  Physically-Based Rendering for Indoor Scene Understanding Using Convolutional Neural Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Pascal Frossard,et al.  Graph-Based Classification of Omnidirectional Images , 2017, 2017 IEEE International Conference on Computer Vision Workshops (ICCVW).

[24]  Min Sun,et al.  Omnidirectional CNN for Visual Place Recognition and Navigation , 2018, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[25]  Miska M. Hannuksela,et al.  Standardization status of 360 degree video coding and delivery , 2017, 2017 IEEE Visual Communications and Image Processing (VCIP).

[26]  Silvio Savarese,et al.  Joint 2D-3D-Semantic Data for Indoor Scene Understanding , 2017, ArXiv.

[27]  Gustavo Carneiro,et al.  Unsupervised CNN for Single View Depth Estimation: Geometry to the Rescue , 2016, ECCV.

[28]  Oisin Mac Aodha,et al.  Unsupervised Monocular Depth Estimation with Left-Right Consistency , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Noah Snavely,et al.  Layer-structured 3D Scene Inference via View Synthesis , 2018, ECCV.

[30]  Shenghua Gao,et al.  Saliency Detection in 360 ^\circ ∘ Videos , 2018, ECCV.

[31]  David Filliat,et al.  Learning Structure-from-Motion from Motion , 2018, ECCV Workshops.

[32]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[33]  Liang Lin,et al.  Single View Stereo Matching , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[34]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[35]  Kristen Grauman,et al.  Flat2Sphere: Learning Spherical Convolution for Fast Features from 360° Imagery , 2017, NIPS 2017.

[36]  Rafael Monroy,et al.  SalNet360: Saliency Maps for omni-directional images with CNN , 2017, Signal Process. Image Commun..

[37]  Li Fei-Fei,et al.  Perceptual Losses for Real-Time Style Transfer and Super-Resolution , 2016, ECCV.

[38]  Anish Shah,et al.  Deep Residual Networks with Exponential Linear Unit , 2016, ArXiv.

[39]  Xu Sun,et al.  Adaptive Gradient Methods with Dynamic Bound of Learning Rate , 2019, ICLR.

[40]  Anelia Angelova,et al.  Unsupervised Learning of Depth and Ego-Motion from Monocular Video Using 3D Geometric Constraints , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[41]  E. Saff,et al.  Distributing many points on a sphere , 1997 .

[42]  Adrian Hilton,et al.  3D Scene Reconstruction from Multiple Spherical Stereo Pairs , 2013, International Journal of Computer Vision.

[43]  Min Sun,et al.  Self-Supervised Learning of Depth and Camera Motion from 360° Videos , 2018, ArXiv.

[44]  A. Makadia,et al.  Learning SO(3) Equivariant Representations with Spherical CNNs , 2019, International Journal of Computer Vision.

[45]  Zhigang Zhu Omnidirectional Stereo Vision , 2001 .

[46]  Andrew Zisserman,et al.  Spatial Transformer Networks , 2015, NIPS.

[47]  Ian D. Reid,et al.  Unsupervised Learning of Monocular Depth Estimation and Visual Odometry with Deep Feature Reconstruction , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[48]  Toby P. Breckon,et al.  Eliminating the Blind Spot: Adapting 3D Object Detection and Monocular Depth Estimation to 360° Panoramic Imagery , 2018, ECCV.

[49]  Geoffrey E. Hinton,et al.  Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.

[50]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[51]  Nassir Navab,et al.  Distortion-Aware Convolutional Filters for Dense Prediction in Panoramic Images , 2018, ECCV.

[52]  Simon Lucey,et al.  Learning Depth from Monocular Videos Using Direct Methods , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.