论文信息 - Robust Consistent Video Depth Estimation

Robust Consistent Video Depth Estimation

We present an algorithm for estimating consistent dense depth maps and camera poses from a monocular video. We integrate a learning-based depth prior, in the form of a convolutional neural network trained for single-image depth estimation, with geometric optimization, to estimate a smooth camera trajectory as well as detailed and stable depth reconstruction. Our algorithm combines two complementary techniques: (1) flexible deformation-splines for low-frequency large-scale alignment and (2) geometry-aware depth filtering for high-frequency alignment of fine depth details. In contrast to prior approaches, our method does not require camera poses as input and achieves robust reconstruction for challenging hand-held cell phone captures containing a significant amount of noise, shake, motion blur, and rolling shutter deformations. Our method quantitatively outperforms state-of-the-arts on the Sintel benchmark for both depth and pose estimations and attains favorable qualitative results across diverse wild datasets.

Johannes Kopf | Xuejian Rong | Jia-Bin Huang

[1] Jia-Bin Huang,et al. DF-Net: Unsupervised Joint Learning of Depth and Flow using Cross-Task Consistency , 2018, ECCV.

[2] Luc Van Gool,et al. The 2017 DAVIS Challenge on Video Object Segmentation , 2017, ArXiv.

[3] Zhengqi Li,et al. MegaDepth: Learning Single-View Depth Prediction from Internet Photos , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[4] Gabriel J. Brostow,et al. Self-Supervised Monocular Depth Hints , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[5] Sylvain Paris,et al. Blind video temporal consistency , 2015, ACM Trans. Graph..

[6] Hongbin Zha,et al. Beyond Tracking: Selecting Memory and Refining Poses for Deep Visual Odometry , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[7] James R. Bergen,et al. Visual odometry , 2004, Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004..

[8] Markus Gross,et al. Practical temporal consistency for image-based graphics applications , 2012, ACM Trans. Graph..

[9] Michael Gleicher,et al. Content-preserving warps for 3D video stabilization , 2009, ACM Trans. Graph..

[10] Ce Liu,et al. Depth Transfer: Depth Extraction from Video Using Non-Parametric Sampling , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[11] Konrad Schindler,et al. Towards Robust Monocular Depth Estimation: Mixing Datasets for Zero-Shot Cross-Dataset Transfer , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[12] Chao Liu,et al. Neural RGB®D Sensing: Depth and Uncertainty From a Video Camera , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[13] Friedrich Fraundorfer,et al. Visual Odometry Part I: The First 30 Years and Fundamentals , 2022 .

[14] Dengxin Dai,et al. Don’t Forget The Past: Recurrent Depth Estimation from Monocular Video , 2020, IEEE Robotics and Automation Letters.

[15] Oisin Mac Aodha,et al. Unsupervised Monocular Depth Estimation with Left-Right Consistency , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16] Daniel Cremers,et al. Challenges in Monocular Visual Odometry: Photometric Calibration, Motion Bias, and Rolling Shutter Effect , 2017, IEEE Robotics and Automation Letters.

[17] Narendra Ahuja,et al. DeepMVS: Learning Multi-view Stereopsis , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[18] Dacheng Tao,et al. Deep Ordinal Regression Network for Monocular Depth Estimation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[19] Zhili Chen,et al. 6-DOF VR videos with a single 360-camera , 2017, 2017 IEEE Virtual Reality (VR).

[20] Vladlen Koltun,et al. Dense Monocular Depth Estimation in Complex Dynamic Scenes , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21] Narendra Ahuja,et al. Temporally coherent completion of dynamic video , 2016, ACM Trans. Graph..

[22] Richard Szeliski,et al. Consistent video depth estimation , 2020, ACM Trans. Graph..

[23] Daniel Cremers,et al. Direct Sparse Odometry , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[24] Nassir Navab,et al. Deeper Depth Prediction with Fully Convolutional Residual Networks , 2016, 2016 Fourth International Conference on 3D Vision (3DV).

[25] William T. Freeman,et al. Learning the Depths of Moving People by Watching Frozen People , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[26] Michael J. Black,et al. Competitive Collaboration: Joint Unsupervised Learning of Depth, Camera Motion, Optical Flow and Motion Segmentation , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[27] Gabriel J. Brostow,et al. Digging Into Self-Supervised Monocular Depth Estimation , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[28] Simon Lucey,et al. Web Stereo Video Supervision for Depth Prediction from Dynamic Scenes , 2019, 2019 International Conference on 3D Vision (3DV).

[29] Anelia Angelova,et al. Depth Prediction Without the Sensors: Leveraging Structure for Unsupervised Learning from Monocular Videos , 2018, AAAI.

[30] Stephen Lin,et al. DPSNet: End-to-end Deep Plane Sweep Stereo , 2019, ICLR.

[31] Rob Fergus,et al. Predicting Depth, Surface Normals and Semantic Labels with a Common Multi-scale Convolutional Architecture , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[32] Jia Deng,et al. RAFT: Recurrent All-Pairs Field Transforms for Optical Flow , 2020, ECCV.

[33] Thomas Brox,et al. A Large Dataset to Train Convolutional Networks for Disparity, Optical Flow, and Scene Flow Estimation , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34] Juan D. Tardós,et al. ORB-SLAM2: An Open-Source SLAM System for Monocular, Stereo, and RGB-D Cameras , 2016, IEEE Transactions on Robotics.

[35] Jia Deng,et al. OASIS: A Large-Scale Dataset for Single Image 3D in the Wild , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[36] Ian D. Reid,et al. Learning Depth from Single Monocular Images Using Deep Convolutional Neural Fields , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[37] Stefan Leutenegger,et al. CodeSLAM - Learning a Compact, Optimisable Representation for Dense Visual SLAM , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[38] Noah Snavely,et al. Unsupervised Learning of Depth and Ego-Motion from Video , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[39] Wolfram Burgard,et al. A benchmark for the evaluation of RGB-D SLAM systems , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[40] Renjie Liao,et al. GeoNet: Geometric Neural Network for Joint Depth and Surface Normal Estimation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[41] Jan-Michael Frahm,et al. Structure-from-Motion Revisited , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[42] Carlos Hernandez,et al. Multi-View Stereo: A Tutorial , 2015, Found. Trends Comput. Graph. Vis..

[43] Jan Kautz,et al. Novel View Synthesis of Dynamic Scenes With Globally Coherent Depths From a Monocular Camera , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[44] Ersin Yumer,et al. Learning Blind Video Temporal Consistency , 2018, ECCV.

[45] Weifeng Chen,et al. Learning Single-Image Depth From Videos Using Quality Assessment Networks , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[46] Luc Van Gool,et al. Self-supervised Object Motion and Depth Estimation from Video , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[47] Richard Szeliski,et al. A Comparison and Evaluation of Multi-View Stereo Reconstruction Algorithms , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[48] Long Quan,et al. MVSNet: Depth Inference for Unstructured Multi-view Stereo , 2018, ECCV.

[49] Anelia Angelova,et al. Depth From Videos in the Wild: Unsupervised Monocular Depth Learning From Unknown Cameras , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[50] Dongbing Gu,et al. UnDeepVO: Monocular Visual Odometry Through Unsupervised Deep Learning , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[51] Nan Yang,et al. D3VO: Deep Depth, Deep Pose and Deep Uncertainty for Monocular Visual Odometry , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[52] Cordelia Schmid,et al. SfM-Net: Learning of Structure and Motion from Video , 2017, ArXiv.

[53] Daniel Cremers,et al. LSD-SLAM: Large-Scale Direct Monocular SLAM , 2014, ECCV.

[54] Weifeng Chen,et al. Single-Image Depth Perception in the Wild , 2016, NIPS.

[55] Jia-Bin Huang,et al. Learning Monocular Visual Odometry via Self-Supervised Long-Term Modeling , 2020, ECCV.

[56] Thomas Brox,et al. DeMoN: Depth and Motion Network for Learning Monocular Stereo , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[57] Sen Wang,et al. DeepVO: Towards end-to-end visual odometry with deep Recurrent Convolutional Neural Networks , 2017, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[58] Hao Su,et al. Normal Assisted Stereo Depth Estimation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[59] Dan Xu,et al. Unsupervised Collaborative Learning of Keyframe Detection and Visual Odometry Towards Monocular Deep SLAM , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[60] Andrew J. Davison,et al. DTAM: Dense tracking and mapping in real-time , 2011, 2011 International Conference on Computer Vision.

[61] Michael J. Black,et al. A Naturalistic Open Source Movie for Optical Flow Evaluation , 2012, ECCV.

[62] Thomas Brox,et al. DeepTAM: Deep Tracking and Mapping , 2018, ECCV.

[63] Luc Van Gool,et al. A Benchmark Dataset and Evaluation Methodology for Video Object Segmentation , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[64] Ian D. Reid,et al. Unsupervised Learning of Monocular Depth Estimation and Visual Odometry with Deep Feature Reconstruction , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[65] Zhichao Yin,et al. GeoNet: Unsupervised Learning of Dense Depth, Optical Flow and Camera Pose , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[66] Xiaogang Wang,et al. Learning Monocular Depth by Distilling Cross-domain Stereo Networks , 2018, ECCV.

[67] Jia Deng,et al. DeepV2D: Video to Depth with Differentiable Structure from Motion , 2018, ICLR.

[68] Nenghai Yu,et al. Coherent Online Video Style Transfer , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[69] Matthias Nießner,et al. ScanNet: Richly-Annotated 3D Reconstructions of Indoor Scenes , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[70] Rob Fergus,et al. Depth Map Prediction from a Single Image using a Multi-Scale Deep Network , 2014, NIPS.

[71] Ganesh Iyer,et al. ∇SLAM: Dense SLAM meets Automatic Differentiation , 2019, 2020 IEEE International Conference on Robotics and Automation (ICRA).

[72] Jonathan T. Barron,et al. Depth from motion for smartphone AR , 2018, ACM Trans. Graph..

[73] Hang Zhao,et al. Unsupervised Monocular Depth Learning in Dynamic Scenes , 2020, CoRL.