Self-Supervised Structure-from-Motion through Tightly-Coupled Depth and Egomotion Networks

Much recent literature has formulated structure-from-motion (SfM) as a self-supervised learning problem where the goal is to jointly learn neural network models of depth and egomotion through view synthesis. Herein, we address the open problem of how to optimally couple the depth and egomotion network components. Toward this end, we introduce several notions of coupling, categorize existing approaches, and present a novel tightly-coupled approach that leverages the interdependence of depth and egomotion at training and at inference time. Our approach uses iterative view synthesis to recursively update the egomotion network input, permitting contextual information to be passed between the components without explicit weight sharing. Through substantial experiments, we demonstrate that our approach promotes consistency between the depth and egomotion predictions at test time, improves generalization on new data, and leads to state-of-the-art accuracy on indoor and outdoor depth and egomotion evaluation benchmarks.

[1]  Anelia Angelova,et al.  Unsupervised Monocular Depth and Ego-Motion Learning With Structure and Semantics , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[2]  Jie Li,et al.  Robust Semi-Supervised Monocular Depth Estimation with Reprojected Distances , 2019, CoRL.

[3]  Timothy D. Barfoot,et al.  State Estimation for Robotics , 2017 .

[4]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[5]  Xinggang Wang,et al.  Deep Online Correction for Monocular Visual Odometry , 2021, 2021 IEEE International Conference on Robotics and Automation (ICRA).

[6]  Konrad Schindler,et al.  Towards Robust Monocular Depth Estimation: Mixing Datasets for Zero-Shot Cross-Dataset Transfer , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[7]  Andrew Zisserman,et al.  Spatial Transformer Networks , 2015, NIPS.

[8]  Andreas Geiger,et al.  Vision meets robotics: The KITTI dataset , 2013, Int. J. Robotics Res..

[9]  Oisin Mac Aodha,et al.  Unsupervised Monocular Depth Estimation with Left-Right Consistency , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Ping Tan,et al.  BA-Net: Dense Bundle Adjustment Network , 2018, ICLR 2018.

[11]  Kaiming He,et al.  Group Normalization , 2018, ECCV.

[12]  Richard Szeliski,et al.  Consistent video depth estimation , 2020, ACM Trans. Graph..

[13]  Yang Wang,et al.  Unsupervised Learning of Camera Pose with Compositional Re-estimation , 2020, 2020 IEEE Winter Conference on Applications of Computer Vision (WACV).

[14]  Stefan Leutenegger,et al.  LS-Net: Learning to Solve Nonlinear Least Squares for Monocular Stereo , 2018, ECCV.

[15]  Yinda Zhang,et al.  DeepSFM: Structure From Motion Via Deep Bundle Adjustment , 2019, ECCV.

[16]  Chang Shu,et al.  Feature-metric Loss for Self-supervised Learning of Depth and Egomotion , 2020, ECCV.

[17]  Cordelia Schmid,et al.  SfM-Net: Learning of Structure and Motion from Video , 2017, ArXiv.

[18]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Oisin Mac Aodha,et al.  The Temporal Opportunist: Self-Supervised Multi-Frame Monocular Depth , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Paul Newman,et al.  1 year, 1000 km: The Oxford RobotCar dataset , 2017, Int. J. Robotics Res..

[21]  Daniel Cremers,et al.  LSD-SLAM: Large-Scale Direct Monocular SLAM , 2014, ECCV.

[22]  Andrew Zisserman,et al.  Monocular Depth Estimation with Self-supervised Instance Adaptation , 2020, ArXiv.

[23]  Luc Van Gool,et al.  CoMoDA: Continuous Monocular Depth Adaptation Using Past Experiences , 2021, 2021 IEEE Winter Conference on Applications of Computer Vision (WACV).

[24]  Jonathan Kelly,et al.  Self-Supervised Scale Recovery for Monocular Depth and Egomotion Estimation , 2021, 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[25]  Jie Li,et al.  Two Stream Networks for Self-Supervised Ego-Motion Estimation , 2019, CoRL.

[26]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[27]  Thomas Brox,et al.  U-Net: Convolutional Networks for Biomedical Image Segmentation , 2015, MICCAI.

[28]  Eero P. Simoncelli,et al.  Image quality assessment: from error visibility to structural similarity , 2004, IEEE Transactions on Image Processing.

[29]  Andreas Geiger,et al.  Are we ready for autonomous driving? The KITTI vision benchmark suite , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[30]  M. Ang,et al.  Toward Hierarchical Self-Supervised Monocular Absolute Depth Estimation for Autonomous Driving Applications , 2020, 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[31]  Chunhua Shen,et al.  Unsupervised Scale-consistent Depth and Ego-motion Learning from Monocular Video , 2019, NeurIPS.

[32]  Rob Fergus,et al.  Predicting Depth, Surface Normals and Semantic Labels with a Common Multi-scale Convolutional Architecture , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[33]  Rares Ambrus,et al.  3D Packing for Self-Supervised Monocular Depth Estimation , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Lei Zhou,et al.  Beyond Photometric Loss for Self-Supervised Ego-Motion Estimation , 2019, 2019 International Conference on Robotics and Automation (ICRA).

[35]  Jan-Michael Frahm,et al.  Recurrent Neural Network for (Un-)Supervised Learning of Monocular Video Visual Odometry and Depth , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Li Liu,et al.  Self-Supervised Joint Learning Framework of Depth Estimation via Implicit Cues , 2020, ArXiv.

[37]  Ping Tan,et al.  DRO: Deep Recurrent Optimizer for Structure-from-Motion , 2021, ArXiv.

[38]  Anelia Angelova,et al.  Unsupervised Learning of Depth and Ego-Motion from Monocular Video Using 3D Geometric Constraints , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[39]  Jia-Bin Huang,et al.  Learning Monocular Visual Odometry via Self-Supervised Long-Term Modeling , 2020, ECCV.

[40]  Thomas Brox,et al.  DeMoN: Depth and Motion Network for Learning Monocular Stereo , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Hang Zhao,et al.  Unsupervised Monocular Depth Learning in Dynamic Scenes , 2020, CoRL.

[42]  Dongbing Gu,et al.  UnDeepVO: Monocular Visual Odometry Through Unsupervised Deep Learning , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[43]  Hongbin Zha,et al.  Self-Supervised Deep Visual Odometry With Online Adaptation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  Noah Snavely,et al.  Unsupervised Learning of Depth and Ego-Motion from Video , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  Ian D. Reid,et al.  Unsupervised Learning of Monocular Depth Estimation and Visual Odometry with Deep Feature Reconstruction , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[46]  Cordelia Schmid,et al.  Self-Supervised Learning With Geometric Constraints in Monocular Video: Connecting Flow, Depth, and Camera , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[47]  Jia Deng,et al.  DeepV2D: Video to Depth with Differentiable Structure from Motion , 2018, ICLR.

[48]  Matthias Nießner,et al.  ScanNet: Richly-Annotated 3D Reconstructions of Indoor Scenes , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[49]  Gabriel J. Brostow,et al.  Digging Into Self-Supervised Monocular Depth Estimation , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).