Learning to Fuse Monocular and Multi-view Cues for Multi-frame Depth Estimation in Dynamic Scenes

Multi-frame depth estimation generally achieves high accuracy relying on the multi-view geometric consistency. When applied in dynamic scenes, e.g., autonomous driving, this consistency is usually violated in the dynamic areas, leading to corrupted estimations. Many multi-frame methods handle dynamic areas by identifying them with explicit masks and compensating the multi-view cues with monocular cues represented as local monocular depth or features. The improvements are limited due to the uncontrolled quality of the masks and the underutilized benefits of the fusion of the two types of cues. In this paper, we propose a novel method to learn to fuse the multi-view and monocular cues encoded as volumes without needing the heuristically crafted masks. As unveiled in our analyses, the multi-view cues capture more accurate geometric information in static areas, and the monocular cues capture more useful contexts in dynamic areas. To let the geometric perception learned from multi-view cues in static areas propagate to the monocular representation in dynamic areas and let monocular cues enhance the representation of multi-view cost volume, we propose a cross-cue fusion (CCF) module, which includes the cross-cue attention (CCA) to encode the spatially non-local relative intra-relations from each source to enhance the representation of the other. Experiments on real-world datasets prove the significant effectiveness and generalization ability of the proposed method.

[1]  Chunhua Shen,et al.  Towards Accurate Reconstruction of 3D Scene Shape From A Single Monocular Image , 2022, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2]  Jinqiu Sun,et al.  Learning Depth via Leveraging Semantics: Self-supervised Monocular Depth Estimation with Both Implicit and Explicit Semantic Guidance , 2021, Pattern Recognit..

[3]  Dahua Lin,et al.  Monocular 3D Object Detection with Depth from Motion , 2022, ECCV.

[4]  Zeming Li,et al.  BEVDepth: Acquisition of Reliable Depth for Multi-view 3D Object Detection , 2022, AAAI.

[5]  Sergey Zakharov,et al.  Multi-Frame Self-Supervised Depth with Transformers , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Guan Huang,et al.  MVSTER: Epipolar Transformer for Efficient Multi-View Stereo , 2022, ECCV.

[7]  Yingli Tian,et al.  Disentangling Object Motion and Occlusion for Unsupervised Multi-frame Monocular Depth , 2022, ECCV.

[8]  Xiaodong Gu,et al.  Neural Window Fully-connected CRFs for Monocular Depth Estimation , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  R. Cipolla,et al.  Multi-View Depth Estimation by Fusing Single-View Depth Probability with Multi-View Geometry , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Marc Pollefeys,et al.  IterMVS: Iterative Probability Estimation for Efficient Multi-View Stereo , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Zhenxing Mi,et al.  Generalized Binary Search Network for Highly-Efficient Multi-View Stereo , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Qi Shan,et al.  MVS2D: Efficient Multiview Stereo via Attention-Driven 2D Convolutions , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Chunhua Shen,et al.  Virtual Normal: Enforcing Geometric Constraints for Accurate and Robust Depth Prediction , 2021, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[14]  Yanning Zhang,et al.  Self-Supervised Monocular Depth Estimation With Frequency-Based Recurrent Refinement , 2022, IEEE Transactions on Multimedia.

[15]  Dong Gong,et al.  Memory-augmented Dynamic Neural Relational Inference , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[16]  Oisin Mac Aodha,et al.  The Temporal Opportunist: Self-Supervised Multi-Frame Monocular Depth , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Stephen Lin,et al.  Learning Monocular Depth in Dynamic Scenes via Instance-Aware Projection Consistency , 2021, AAAI.

[18]  Chunhua Shen,et al.  Learning to Recover 3D Scene Shape from a Single Image , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Marc Pollefeys,et al.  DeepVideoMVS: Multi-View Stereo on Video with Recurrent Spatio-Temporal Fusion , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Peter Wonka,et al.  AdaBins: Depth Estimation Using Adaptive Bins , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Daniel Cremers,et al.  MonoRec: Semi-Supervised Dense Reconstruction in Dynamic Environments from a Single Moving Camera , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Hang Zhao,et al.  Unsupervised Monocular Depth Learning in Dynamic Scenes , 2020, CoRL.

[23]  Yanning Zhang,et al.  Enhancing Self-supervised Monocular Depth Estimation via Incorporating Robust Constraints , 2020, ACM Multimedia.

[24]  Tim Fingscheidt,et al.  Self-Supervised Monocular Depth Estimation: Solving the Dynamic Object Problem by Semantic Guidance , 2020, ECCV.

[25]  Siyu Zhu,et al.  Cascade Cost Volume for High-Resolution Multi-View Stereo and Stereo Matching , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Rares Ambrus,et al.  3D Packing for Self-Supervised Monocular Depth Estimation , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Matthias Nießner,et al.  NRMVS: Non-Rigid Multi-View Stereo , 2019, 2020 IEEE Winter Conference on Applications of Computer Vision (WACV).

[28]  Rudolf Mester,et al.  Mono-SF: Multi-View Geometry Meets Single-View Depth for Monocular Scene Flow Estimation of Dynamic Traffic Scenes , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[29]  Chunhua Shen,et al.  Enforcing Geometric Constraints of Virtual Normal for Depth Prediction , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[30]  Quoc V. Le,et al.  EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks , 2019, ICML.

[31]  Anelia Angelova,et al.  Depth From Videos in the Wild: Unsupervised Monocular Depth Learning From Unknown Cameras , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[32]  Long Quan,et al.  Recurrent MVSNet for High-Resolution Multi-View Stereo Depth Inference , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Jitendra Malik,et al.  Learning Independent Object Motion From Unlabelled Stereoscopic Videos , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Anelia Angelova,et al.  Depth Prediction Without the Sensors: Leveraging Structure for Unsupervised Learning from Monocular Videos , 2018, AAAI.

[35]  Gabriel J. Brostow,et al.  Digging Into Self-Supervised Monocular Depth Estimation , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[36]  Long Quan,et al.  MVSNet: Depth Inference for Unstructured Multi-view Stereo , 2018, ECCV.

[37]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[38]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[39]  Ian D. Reid,et al.  From Motion Blur to Motion Flow: A Deep Learning Solution for Removing Heterogeneous Motion Blur , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[42]  Rob Fergus,et al.  Depth Map Prediction from a Single Image using a Multi-Scale Deep Network , 2014, NIPS.

[43]  Andreas Geiger,et al.  Are we ready for autonomous driving? The KITTI vision benchmark suite , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[44]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[45]  Eero P. Simoncelli,et al.  Image quality assessment: from error visibility to structural similarity , 2004, IEEE Transactions on Image Processing.