论文信息 - DeepV2D: Video to Depth with Differentiable Structure from Motion

DeepV2D: Video to Depth with Differentiable Structure from Motion

We propose DeepV2D, an end-to-end deep learning architecture for predicting depth from video. DeepV2D combines the representation ability of neural networks with the geometric principles governing image formation. We compose a collection of classical geometric algorithms, which are converted into trainable modules and combined into an end-to-end differentiable architecture. DeepV2D interleaves two stages: motion estimation and depth estimation. During inference, motion and depth estimation are alternated and converge to accurate depth. Code is available this https URL.

Jia Deng | Zachary Teed | Jia Deng | Zachary Teed

[1] Jan-Michael Frahm,et al. Reconstructing the World* in Six Days *(As Captured by the Yahoo 100 Million Image Dataset) , 2015, CVPR 2015.

[2] Cordelia Schmid,et al. SfM-Net: Learning of Structure and Motion from Video , 2017, ArXiv.

[3] Daniel Cremers,et al. LSD-SLAM: Large-Scale Direct Monocular SLAM , 2014, ECCV.

[4] Nassir Navab,et al. Deeper Depth Prediction with Fully Convolutional Residual Networks , 2016, 2016 Fourth International Conference on 3D Vision (3DV).

[5] Daniel Cremers,et al. Direct Sparse Odometry , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[6] Rob Fergus,et al. Depth Map Prediction from a Single Image using a Multi-Scale Deep Network , 2014, NIPS.

[7] Thomas Brox,et al. DeepTAM: Deep Tracking and Mapping , 2018, ECCV.

[8] Ping Tan,et al. BA-Net: Dense Bundle Adjustment Network , 2018, ICLR 2018.

[9] Anelia Angelova,et al. Unsupervised Learning of Depth and Ego-Motion from Monocular Video Using 3D Geometric Constraints , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[10] H. C. Longuet-Higgins,et al. A computer algorithm for reconstructing a scene from two projections , 1981, Nature.

[11] Andrea Fusiello,et al. Improving the efficiency of hierarchical structure-and-motion , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[12] Shiqi Li,et al. A Robust O(n) Solution to the Perspective-n-Point Problem , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[13] In-So Kweon,et al. High-Quality Depth from Uncalibrated Small Motion Clip , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14] Sen Wang,et al. DeepVO: Towards end-to-end visual odometry with deep Recurrent Convolutional Neural Networks , 2017, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[15] Roberto Cipolla,et al. Geometric Loss Functions for Camera Pose Regression with Deep Learning , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16] J. M. M. Montiel,et al. ORB-SLAM: A Versatile and Accurate Monocular SLAM System , 2015, IEEE Transactions on Robotics.

[17] Noah Snavely,et al. Unsupervised Learning of Depth and Ego-Motion from Video , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18] Andrew J. Davison,et al. DTAM: Dense tracking and mapping in real-time , 2011, 2011 International Conference on Computer Vision.

[19] V. Lepetit,et al. EPnP: An Accurate O(n) Solution to the PnP Problem , 2009, International Journal of Computer Vision.

[20] Jean Ponce,et al. Accurate, Dense, and Robust Multiview Stereopsis , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[21] Andreas Geiger,et al. Vision meets robotics: The KITTI dataset , 2013, Int. J. Robotics Res..

[22] Rahul Sukthankar,et al. MatchNet: Unifying feature and metric learning for patch-based matching , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23] Richard Szeliski,et al. Building Rome in a day , 2009, ICCV.

[24] Andrew Zisserman,et al. Spatial Transformer Networks , 2015, NIPS.

[25] Thomas Brox,et al. FlowNet: Learning Optical Flow with Convolutional Networks , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[26] Wolfram Burgard,et al. Deep Auxiliary Learning for Visual Localization and Odometry , 2018, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[27] Thomas Brox,et al. A Large Dataset to Train Convolutional Networks for Disparity, Optical Flow, and Scene Flow Estimation , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28] Long Quan,et al. Relative 3D Reconstruction Using Multiple Uncalibrated Images , 1995, Int. J. Robotics Res..

[29] Jan Kautz,et al. PWC-Net: CNNs for Optical Flow Using Pyramid, Warping, and Cost Volume , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[30] Hauke Strasdat,et al. Scale Drift-Aware Large Scale Monocular SLAM , 2010, Robotics: Science and Systems.

[31] Jörg Stückler,et al. Semi-Supervised Deep Learning for Monocular Depth Map Prediction , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[32] Andrew Owens,et al. SUN3D: A Database of Big Spaces Reconstructed Using SfM and Object Labels , 2013, 2013 IEEE International Conference on Computer Vision.

[33] Peter Wonka,et al. High Quality Monocular Depth Estimation via Transfer Learning , 2018, ArXiv.

[34] Carlos Hernandez,et al. Multi-View Stereo: A Tutorial , 2015, Found. Trends Comput. Graph. Vis..

[35] Yong-Sheng Chen,et al. Pyramid Stereo Matching Network , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[36] Oisin Mac Aodha,et al. Unsupervised Monocular Depth Estimation with Left-Right Consistency , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[37] Stefan Leutenegger,et al. LS-Net: Learning to Solve Nonlinear Least Squares for Monocular Stereo , 2018, ECCV.

[38] Jörg Stückler,et al. Deep Virtual Stereo Odometry: Leveraging Deep Depth Prediction for Monocular Direct Sparse Odometry , 2018, ECCV.

[39] Matthias Nießner,et al. ScanNet: Richly-Annotated 3D Reconstructions of Indoor Scenes , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[40] Daniel Cremers,et al. Dense visual SLAM for RGB-D cameras , 2013, 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[41] Noah Snavely,et al. Scene Reconstruction and Visualization from Internet Photo Collections: A Survey , 2011, IPSJ Trans. Comput. Vis. Appl..

[42] Zhichao Yin,et al. GeoNet: Unsupervised Learning of Dense Depth, Optical Flow and Camera Pose , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[43] Yuan Yu,et al. TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[44] Stefan Leutenegger,et al. CodeSLAM - Learning a Compact, Optimisable Representation for Dense Visual SLAM , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[45] Derek Hoiem,et al. Indoor Segmentation and Support Inference from RGBD Images , 2012, ECCV.

[46] Wolfram Burgard,et al. A benchmark for the evaluation of RGB-D SLAM systems , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[47] Juan D. Tardós,et al. ORB-SLAM2: An Open-Source SLAM System for Monocular, Stereo, and RGB-D Cameras , 2016, IEEE Transactions on Robotics.

[48] Roberto Cipolla,et al. PoseNet: A Convolutional Network for Real-Time 6-DOF Camera Relocalization , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[49] Tianqi Chen,et al. Training Deep Nets with Sublinear Memory Cost , 2016, ArXiv.

[50] Jitendra Malik,et al. Learning a Multi-View Stereo Machine , 2017, NIPS.

[51] Simon Lucey,et al. Learning Depth from Monocular Videos Using Direct Methods , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[52] Jia Deng,et al. Stacked Hourglass Networks for Human Pose Estimation , 2016, ECCV.

[53] Alex Kendall,et al. End-to-End Learning of Geometry and Context for Deep Stereo Regression , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[54] G LoweDavid,et al. Distinctive Image Features from Scale-Invariant Keypoints , 2004 .

[55] Rob Fergus,et al. Predicting Depth, Surface Normals and Semantic Labels with a Common Multi-scale Convolutional Architecture , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[56] Dacheng Tao,et al. Deep Ordinal Regression Network for Monocular Depth Estimation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[57] Jan-Michael Frahm,et al. Structure-from-Motion Revisited , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[58] Long Quan,et al. MVSNet: Depth Inference for Unstructured Multi-view Stereo , 2018, ECCV.

[59] H. Hirschmüller. Accurate and Efficient Stereo Processing by Semi-Global Matching and Mutual Information , 2005, CVPR.

[60] Thomas Brox,et al. DeMoN: Depth and Motion Network for Learning Monocular Stereo , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[61] Vladlen Koltun,et al. Dense Monocular Depth Estimation in Complex Dynamic Scenes , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).