DeepVideoMVS: Multi-View Stereo on Video with Recurrent Spatio-Temporal Fusion

We propose an online multi-view depth prediction approach on posed video streams, where the scene geometry information computed in the previous time steps is propagated to the current time step in an efficient and geometrically plausible way. The backbone of our approach is a real-time capable, lightweight encoder-decoder that relies on cost volumes computed from pairs of images. We extend it by placing a ConvLSTM cell at the bottleneck layer, which compresses an arbitrary amount of past information in its states. The novelty lies in propagating the hidden state of the cell by accounting for the viewpoint changes between time steps. At a given time step, we warp the previous hidden state into the current camera plane using the previous depth prediction. Our extension brings only a small overhead of computation time and memory consumption, while improving the depth predictions significantly. As a result, we outperform the existing state-of-the-art multi-view stereo methods on most of the evaluated metrics in hundreds of indoor scenes while maintaining a real-time performance. Code available: https://github.com/ardaduz/deep-video-mvs

[1]  Long Quan,et al.  MVSNet: Depth Inference for Unstructured Multi-view Stereo , 2018, ECCV.

[2]  Sertac Karaman,et al.  Sparse-to-Dense: Depth Prediction from Sparse Depth Samples and a Single Image , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[3]  Ian D. Reid,et al.  Unsupervised Learning of Monocular Depth Estimation and Visual Odometry with Deep Feature Reconstruction , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[4]  Ying Li,et al.  Exploiting Temporal Consistency for Real-Time Video Depth Estimation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[5]  Yuxin Hou,et al.  Multi-View Stereo by Temporal Nonparametric Fusion , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[6]  Shaojie Shen,et al.  VINS-Mono: A Robust and Versatile Monocular Visual-Inertial State Estimator , 2017, IEEE Transactions on Robotics.

[7]  Jan-Michael Frahm,et al.  Recurrent Neural Network for (Un-)Supervised Learning of Monocular Video Visual Odometry and Depth , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Bo Chen,et al.  MnasNet: Platform-Aware Neural Architecture Search for Mobile , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Ayan Chakrabarti,et al.  Fast Deep Stereo with 2D Convolutional Processing of Cost Signatures , 2019, 2020 IEEE Winter Conference on Applications of Computer Vision (WACV).

[10]  Matthias Nießner,et al.  ScanNet: Richly-Annotated 3D Reconstructions of Indoor Scenes , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Rob Fergus,et al.  Depth Map Prediction from a Single Image using a Multi-Scale Deep Network , 2014, NIPS.

[12]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[13]  Jan-Michael Frahm,et al.  Real-Time Plane-Sweeping Stereo with Multiple Sweeping Directions , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[14]  Christopher Joseph Pal,et al.  Delving Deeper into Convolutional Networks for Learning Video Representations , 2015, ICLR.

[15]  Alex Kendall,et al.  End-to-End Learning of Geometry and Context for Deep Stereo Regression , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[16]  Davide Scaramuzza,et al.  Visual-Inertial Odometry of Aerial Robots , 2019, ArXiv.

[17]  Ben Glocker,et al.  Real-time RGB-D camera relocalization , 2013, 2013 IEEE International Symposium on Mixed and Augmented Reality (ISMAR).

[18]  Jitendra Malik,et al.  Learning a Multi-View Stereo Machine , 2017, NIPS.

[19]  Chunhua Shen,et al.  Enforcing Geometric Constraints of Virtual Normal for Depth Prediction , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[20]  Roland Siegwart,et al.  Iterated extended Kalman filter based visual-inertial odometry using direct photometric feedback , 2017, Int. J. Robotics Res..

[21]  Hamid Laga,et al.  A Survey on Deep Learning Architectures for Image-based Depth Reconstruction , 2019, ArXiv.

[22]  Sebastian Nowozin,et al.  Occupancy Networks: Learning 3D Reconstruction in Function Space , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Vijay Badrinarayanan,et al.  DELTAS: Depth Estimation by Learning Triangulation and Densification of Sparse Points , 2020, ECCV.

[24]  David Fofi,et al.  A comparative survey on invisible structured light , 2004, IS&T/SPIE Electronic Imaging.

[25]  Andrew J. Davison,et al.  A benchmark for RGB-D visual odometry, 3D reconstruction and SLAM , 2014, 2014 IEEE International Conference on Robotics and Automation (ICRA).

[26]  Wolfram Burgard,et al.  A benchmark for the evaluation of RGB-D SLAM systems , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[27]  Dacheng Tao,et al.  Deep Ordinal Regression Network for Monocular Depth Estimation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[28]  Dengxin Dai,et al.  Don’t Forget The Past: Recurrent Depth Estimation from Monocular Video , 2020, IEEE Robotics and Automation Letters.

[29]  Thomas Brox,et al.  U-Net: Convolutional Networks for Biomedical Image Segmentation , 2015, MICCAI.

[30]  Feng Liu,et al.  Softmax Splatting for Video Frame Interpolation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Hao Li,et al.  Visualizing the Loss Landscape of Neural Nets , 2017, NeurIPS.

[32]  Peter Wonka,et al.  High Quality Monocular Depth Estimation via Transfer Learning , 2018, ArXiv.

[33]  Andrew Zisserman,et al.  Spatial Transformer Networks , 2015, NIPS.

[34]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[35]  Lu Fang,et al.  SurfaceNet: An End-to-End 3D Neural Network for Multiview Stereopsis , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[36]  Kaiming He,et al.  Feature Pyramid Networks for Object Detection , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Aini Hussain,et al.  A Survey on LiDAR Scanning Mechanisms , 2020, Electronics.

[38]  Pratul P. Srinivasan,et al.  NeRF , 2020, ECCV.

[39]  Il Hong Suh,et al.  From Big to Small: Multi-Scale Local Planar Guidance for Monocular Depth Estimation , 2019, ArXiv.

[40]  Dit-Yan Yeung,et al.  Convolutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting , 2015, NIPS.

[41]  Vladlen Koltun,et al.  Robust reconstruction of indoor scenes , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[42]  Stephen Lin,et al.  DPSNet: End-to-end Deep Plane Sweep Stereo , 2019, ICLR.

[43]  Vijay Badrinarayanan,et al.  Atlas: End-to-End 3D Scene Reconstruction from Posed Images , 2020, ECCV.

[44]  Sepp Hochreiter,et al.  Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs) , 2015, ICLR.

[45]  Eric Brachmann,et al.  Learning Less is More - 6D Camera Localization via 3D Surface Regression , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[46]  Robert T. Collins,et al.  A space-sweep approach to true multi-image matching , 1996, Proceedings CVPR IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[47]  Matthias Nießner,et al.  3DMatch: Learning Local Geometric Descriptors from RGB-D Reconstructions , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[48]  Hongdong Li,et al.  Open-World Stereo Video Matching with Deep RNN , 2018, ECCV.

[49]  Thomas Brox,et al.  Temporally Consistent Depth Estimation in Videos with Recurrent Architectures , 2018, ECCV Workshops.

[50]  Chao Liu,et al.  Neural RGB®D Sensing: Depth and Uncertainty From a Video Camera , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[51]  Jing Xu,et al.  Point-Based Multi-View Stereo Network , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[52]  Dieter Fox,et al.  Unsupervised feature learning for 3D scene labeling , 2014, 2014 IEEE International Conference on Robotics and Automation (ICRA).

[53]  Shaojie Shen,et al.  MVDepthNet: Real-Time Multiview Depth Estimation Neural Network , 2018, 2018 International Conference on 3D Vision (3DV).