MonoRec: Semi-Supervised Dense Reconstruction in Dynamic Environments from a Single Moving Camera

In this paper, we propose MonoRec, a semi-supervised monocular dense reconstruction architecture that predicts depth maps from a single moving camera in dynamic environments. MonoRec is based on a multi-view stereo setting which encodes the information of multiple consecutive images in a cost volume. To deal with dynamic objects in the scene, we introduce a MaskModule that predicts moving object masks by leveraging the photometric inconsistencies encoded in the cost volumes. Unlike other multi-view stereo methods, MonoRec is able to reconstruct both static and moving objects by leveraging the predicted masks. Furthermore, we present a novel multi-stage training scheme with a semi-supervised loss formulation that does not require LiDAR depth values. We carefully evaluate MonoRec on the KITTI dataset and show that it achieves state-of-theart performance compared to both multi-view and singleview methods. With the model trained on KITTI, we furthermore demonstrate that MonoRec is able to generalize well to both the Oxford RobotCar dataset and the more challenging TUM-Mono dataset recorded by a handheld camera. Code and related materials are available at https://vision.in.tum.de/research/monorec.

[1]  Oisin Mac Aodha,et al.  The Temporal Opportunist: Self-Supervised Multi-Frame Monocular Depth , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Javier Civera,et al.  Information-Driven Direct RGB-D Odometry , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Richard Szeliski,et al.  Consistent video depth estimation , 2020, ACM Trans. Graph..

[4]  Zehao Yu,et al.  Fast-MVSNet: Sparse-to-Dense Multi-View Stereo With Learned Propagation and Gauss-Newton Refinement , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Vijay Badrinarayanan,et al.  Atlas: End-to-End 3D Scene Reconstruction from Posed Images , 2020, ECCV.

[6]  Nan Yang,et al.  D3VO: Deep Depth, Deep Pose and Deep Uncertainty for Monocular Visual Odometry , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Andrew J. Davison,et al.  DeepFactors: Real-Time Probabilistic Dense Monocular SLAM , 2020, IEEE Robotics and Automation Letters.

[8]  J. Álvarez,et al.  Cost Volume Pyramid Based Depth Inference for Multi-View Stereo , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Siyu Zhu,et al.  Cascade Cost Volume for High-Resolution Multi-View Stereo and Stereo Matching , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[11]  Tao Guan,et al.  P-MVSNet: Learning Patch-Wise Matching Confidence Aggregation for Multi-View Stereo , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[12]  Jiansheng Chen,et al.  MVSCRF: Learning Multi-View Stereo With Conditional Random Fields , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[13]  Jing Xu,et al.  Point-Based Multi-View Stereo Network , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[14]  Tom van Dijk,et al.  How Do Neural Networks See Depth in Single Images? , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[15]  Rares Ambrus,et al.  3D Packing for Self-Supervised Monocular Depth Estimation , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Stephen Lin,et al.  DPSNet: End-to-end Deep Plane Sweep Stereo , 2019, ICLR.

[17]  William T. Freeman,et al.  Learning the Depths of Moving People by Watching Frozen People , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Jörg Stückler,et al.  Visual-Inertial Mapping With Non-Linear Factor Recovery , 2019, IEEE Robotics and Automation Letters.

[19]  Yuxin Hou,et al.  Multi-View Stereo by Temporal Nonparametric Fusion , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[20]  Anelia Angelova,et al.  Depth From Videos in the Wild: Unsupervised Monocular Depth Learning From Unknown Cameras , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[21]  Matteo Matteucci,et al.  TAPA-MVS: Textureless-Aware PAtchMatch Multi-View Stereo , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[22]  Long Quan,et al.  Recurrent MVSNet for High-Resolution Multi-View Stereo Depth Inference , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Thomas Brox,et al.  DeepTAM: Deep Tracking and Mapping , 2018, ECCV.

[24]  Shaojie Shen,et al.  MVDepthNet: Real-Time Multiview Depth Estimation Neural Network , 2018, 2018 International Conference on 3D Vision (3DV).

[25]  Jörg Stückler,et al.  Deep Virtual Stereo Odometry: Leveraging Deep Depth Prediction for Monocular Direct Sparse Odometry , 2018, ECCV.

[26]  Gabriel J. Brostow,et al.  Digging Into Self-Supervised Monocular Depth Estimation , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[27]  Dacheng Tao,et al.  Deep Ordinal Regression Network for Monocular Depth Estimation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[28]  Long Quan,et al.  MVSNet: Depth Inference for Unstructured Multi-view Stereo , 2018, ECCV.

[29]  Stefan Leutenegger,et al.  CodeSLAM - Learning a Compact, Optimisable Representation for Dense Visual SLAM , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[30]  Narendra Ahuja,et al.  DeepMVS: Learning Multi-view Stereopsis , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[31]  Ian D. Reid,et al.  Unsupervised Learning of Monocular Depth Estimation and Visual Odometry with Deep Feature Reconstruction , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[32]  Zhichao Yin,et al.  GeoNet: Unsupervised Learning of Dense Depth, Optical Flow and Camera Pose , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[33]  Long Quan,et al.  Relative Camera Refinement for Accurate Dense Reconstruction , 2017, 2017 International Conference on 3D Vision (3DV).

[34]  Thomas Brox,et al.  Sparsity Invariant CNNs , 2017, 2017 International Conference on 3D Vision (3DV).

[35]  Lu Fang,et al.  SurfaceNet: An End-to-End 3D Neural Network for Multiview Stereopsis , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[36]  Jitendra Malik,et al.  Learning a Multi-View Stereo Machine , 2017, NIPS.

[37]  Noah Snavely,et al.  Unsupervised Learning of Depth and Ego-Motion from Video , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Federico Tombari,et al.  CNN-SLAM: Real-Time Dense Monocular SLAM with Learned Depth Prediction , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  Paul Newman,et al.  1 year, 1000 km: The Oxford RobotCar dataset , 2017, Int. J. Robotics Res..

[40]  Jan-Michael Frahm,et al.  Pixelwise View Selection for Unstructured Multi-View Stereo , 2016, ECCV.

[41]  Oisin Mac Aodha,et al.  Unsupervised Monocular Depth Estimation with Left-Right Consistency , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[42]  Daniel Cremers,et al.  Direct Sparse Odometry , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[43]  Daniel Cremers,et al.  A Photometrically Calibrated Benchmark For Monocular Visual Odometry , 2016, ArXiv.

[44]  Vladlen Koltun,et al.  Dense Monocular Depth Estimation in Complex Dynamic Scenes , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  Nassir Navab,et al.  Deeper Depth Prediction with Fully Convolutional Residual Networks , 2016, 2016 Fourth International Conference on 3D Vision (3DV).

[46]  Sebastian Ramos,et al.  The Cityscapes Dataset for Semantic Urban Scene Understanding , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[47]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[48]  Konrad Schindler,et al.  Massively Parallel Multiview Stereopsis by Surface Normal Diffusion , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[49]  Chunhua Shen,et al.  Depth and surface normal estimation from monocular images using regression on deep features and hierarchical CRFs , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[50]  Andrew Zisserman,et al.  Spatial Transformer Networks , 2015, NIPS.

[51]  Thomas Brox,et al.  U-Net: Convolutional Networks for Biomedical Image Segmentation , 2015, MICCAI.

[52]  Rob Fergus,et al.  Predicting Depth, Surface Normals and Semantic Labels with a Common Multi-scale Convolutional Architecture , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[53]  Davide Scaramuzza,et al.  REMODE: Probabilistic, monocular dense reconstruction in real time , 2014, 2014 IEEE International Conference on Robotics and Automation (ICRA).

[54]  Rui Yu,et al.  Video Pop-up: Monocular 3D Reconstruction of Dynamic Scenes , 2014, ECCV.

[55]  Rob Fergus,et al.  Depth Map Prediction from a Single Image using a Multi-Scale Deep Network , 2014, NIPS.

[56]  Andreas Geiger,et al.  Vision meets robotics: The KITTI dataset , 2013, Int. J. Robotics Res..

[57]  Wolfram Burgard,et al.  A benchmark for the evaluation of RGB-D SLAM systems , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[58]  C. Strecha,et al.  Efficient large-scale multi-view stereo for ultra high-resolution image sets , 2012, Machine Vision and Applications.

[59]  Andreas Geiger,et al.  Are we ready for autonomous driving? The KITTI vision benchmark suite , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[60]  Andrew J. Davison,et al.  DTAM: Dense tracking and mapping in real-time , 2011, 2011 International Conference on Computer Vision.

[61]  Daniel Cremers,et al.  Real-Time Dense Geometry from a Handheld Camera , 2010, DAGM-Symposium.

[62]  Jean Ponce,et al.  Accurate, Dense, and Robust Multiview Stereopsis , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[63]  Roberto Cipolla,et al.  Using Multiple Hypotheses to Improve Depth-Maps for Multi-View Stereo , 2008, ECCV.

[64]  Long Quan,et al.  A quasi-dense approach to surface reconstruction from uncalibrated images , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[65]  Eero P. Simoncelli,et al.  Image quality assessment: from error visibility to structural similarity , 2004, IEEE Transactions on Image Processing.

[66]  Kiriakos N. Kutulakos,et al.  A Theory of Shape by Space Carving , 1999, Proceedings of the Seventh IEEE International Conference on Computer Vision.

[67]  S. Seitz,et al.  Photorealistic Scene Reconstruction by Voxel Coloring , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.