Self-Supervised Learning of Depth and Camera Motion from 360{\deg} Videos

As 360° cameras become prevalent in many autonomous systems (e.g., self-driving cars and drones), efficient 360° perception becomes more and more important. We propose a novel self-supervised learning approach for predicting the omnidirectional depth and camera motion from a 360° video. In particular, starting from the SfMLearner, which is designed for cameras with normal field-of-view, we introduce three key features to process 360° images efficiently. Firstly, we convert each image from equirectangular projection to cubic projection in order to avoid image distortion. In each network layer, we use Cube Padding (CP), which pads intermediate features from adjacent faces, to avoid image boundaries. Secondly, we propose a novel "spherical" photometric consistency constraint on the whole viewing sphere. In this way, no pixel will be projected outside the image boundary which typically happens in images with normal field-of-view. Finally, rather than naively estimating six independent camera motions (i.e., naively applying SfM-Learner to each face on a cube), we propose a novel camera pose consistency loss to ensure the estimated camera motions reaching consensus. To train and evaluate our approach, we collect a new PanoSUNCG dataset containing a large amount of 360° videos with groundtruth depth and camera motion. Our approach achieves state-of-the-art depth prediction and camera motion estimation on PanoSUNCG with faster inference speed comparing to equirectangular. In real-world indoor videos, our approach can also achieve qualitatively reasonable depth prediction by acquiring model pre-trained on PanoSUNCG.

[1]  M. Hebert,et al.  Omni-directional structure from motion , 2000, Proceedings IEEE Workshop on Omnidirectional Vision (Cat. No.PR00704).

[2]  Robert Laganière,et al.  Orientation and Pose recovery from Spherical Panoramas , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[3]  Didier Stricker,et al.  Structure from Motion using full spherical panoramic cameras , 2011, 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops).

[4]  Daniel Cremers,et al.  LSD-SLAM: Large-Scale Direct Monocular SLAM , 2014, ECCV.

[5]  Andreas Geiger,et al.  Omnidirectional 3D reconstruction in augmented Manhattan worlds , 2014, 2014 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[6]  Daniel Cremers,et al.  Large-scale direct SLAM for omnidirectional cameras , 2015, 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[7]  Nassir Navab,et al.  Deeper Depth Prediction with Fully Convolutional Residual Networks , 2016, 2016 Fourth International Conference on 3D Vision (3DV).

[8]  Gustavo Carneiro,et al.  Unsupervised CNN for Single View Depth Estimation: Geometry to the Rescue , 2016, ECCV.

[9]  John Flynn,et al.  Deep Stereo: Learning to Predict New Views from the World's Imagery , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  A. Yamashita,et al.  3D reconstruction of structures using spherical cameras with small motion , 2016, International Conference on Control, Automation and Systems.

[11]  In-So Kweon,et al.  All-Around Depth from Small Motion with a Spherical Panoramic Camera , 2016, ECCV.

[12]  Anton Umek,et al.  Evaluation of Smartphone Inertial Sensor Performance for Cross-Platform Mobile Applications , 2016, Sensors.

[13]  Dieter Fox,et al.  SE3-nets: Learning rigid body motion using deep neural networks , 2016, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[14]  Federico Tombari,et al.  CNN-SLAM: Real-Time Dense Monocular SLAM with Learned Depth Prediction , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Noah Snavely,et al.  Unsupervised Learning of Depth and Ego-Motion from Video , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Oisin Mac Aodha,et al.  Unsupervised Monocular Depth Estimation with Left-Right Consistency , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Min Sun,et al.  Tell Me Where to Look: Investigating Ways for Assisting Focus in 360° Video , 2017, CHI.

[18]  Thomas A. Funkhouser,et al.  Semantic Scene Completion from a Single Depth Image , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Kristen Grauman,et al.  Flat2Sphere: Learning Spherical Convolution for Fast Features from 360° Imagery , 2017, NIPS 2017.

[20]  Cordelia Schmid,et al.  SfM-Net: Learning of Structure and Motion from Video , 2017, ArXiv.

[21]  Ming-Yu Liu,et al.  Deep 360 Pilot: Learning a Deep Agent for Piloting through 360° Sports Videos , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Torsten Sattler,et al.  3D visual perception for self-driving cars using a multi-camera system: Calibration, mapping, localization, and obstacle detection , 2017, Image Vis. Comput..

[23]  Kristen Grauman,et al.  Making 360° Video Watchable in 2D: Learning Videography for Click Free Viewing , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Min Sun,et al.  Omnidirectional CNN for Visual Place Recognition and Navigation , 2018, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[25]  Ming-Hsuan Yang,et al.  Semantic-Driven Generation of Hyperlapse from 360 Degree Video , 2018, IEEE Transactions on Visualization and Computer Graphics.

[26]  Simon Lucey,et al.  Learning Depth from Monocular Videos Using Direct Methods , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[27]  Anelia Angelova,et al.  Unsupervised Learning of Depth and Ego-Motion from Monocular Video Using 3D Geometric Constraints , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[28]  Min Sun,et al.  Cube Padding for Weakly-Supervised Saliency Prediction in 360° Videos , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.