STS: Surround-view Temporal Stereo for Multi-view 3D Detection

Learning accurate depth is essential to multi-view 3D object detection. Recent approaches mainly learn depth from monocular images, which confront inherent difficulties due to the ill-posed nature of monocular depth learning. Instead of using a sole monocular depth method, in this work, we propose a novel Surround-view Temporal Stereo (STS) technique that leverages the geometry correspondence between frames across time to facilitate accurate depth learning. Specifically, we regard the field of views from all cameras around the ego vehicle as a unified view, namely surroundview, and conduct temporal stereo matching on it. The resulting geometrical correspondence between different frames from STS is utilized and combined with the monocular depth to yield final depth prediction. Comprehensive experiments on nuScenes show that STS greatly boosts 3D detection ability, notably for medium and long distance objects. On BEVDepth with ResNet-50 backbone, STS improves mAP and NDS by 2.6% and 1.4%, respectively. Consistent improvements are observed when using a larger backbone and a larger image resolution, demonstrating its effectiveness

[1]  Dahua Lin,et al.  Monocular 3D Object Detection with Depth from Motion , 2022, ECCV.

[2]  Zeming Li,et al.  BEVDepth: Acquisition of Reliable Depth for Multi-view 3D Object Detection , 2022, AAAI.

[3]  Xi Li,et al.  MonoGround: Detecting Monocular 3D Objects from the Ground , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Sergey Zakharov,et al.  Multi-Frame Self-Supervised Depth with Transformers , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Jifeng Dai,et al.  BEVFormer: Learning Bird's-Eye-View Representation from Multi-Camera Images via Spatiotemporal Transformers , 2022, ECCV.

[6]  Junjie Huang,et al.  BEVDet4D: Exploit Temporal Cues in Multi-camera 3D Object Detection , 2022, ArXiv.

[7]  Jian Sun,et al.  PETR: Position Embedding Transformation for Multi-View 3D Object Detection , 2022, ECCV.

[8]  Junda Cheng,et al.  Attention Concatenation Volume for Accurate and Efficient Stereo Matching , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Zhenyu Wang,et al.  Rethinking Depth Estimation for Multi-View Stereo: A Unified Representation , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  R. Cipolla,et al.  Multi-View Depth Estimation by Fusing Single-View Depth Probability with Multi-View Geometry , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Dalong Du,et al.  BEVDet: High-performance Multi-camera 3D Object Detection in Bird-Eye-View , 2021, ArXiv.

[12]  Yilun Wang,et al.  DETR3D: 3D Object Detection from Multi-view Images via 3D-to-2D Queries , 2021, CoRL.

[13]  Guoping Wang,et al.  AA-RMVSNet: Adaptive Aggregation Recurrent Multi-view Stereo Network , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[14]  Oisin Mac Aodha,et al.  The Temporal Opportunist: Self-Supervised Multi-Frame Monocular Depth , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Xinge Zhu,et al.  FCOS3D: Fully Convolutional One-Stage Monocular 3D Object Detection , 2021, 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW).

[16]  Jason J. Corso,et al.  Depth from Camera Motion and Object Detection , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Silvano Galliani,et al.  PatchmatchNet: Learned Multi-View Patchmatch Stereo , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Peter Wonka,et al.  AdaBins: Depth Estimation Using Adaptive Bins , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Daniel Cremers,et al.  MonoRec: Semi-Supervised Dense Reconstruction in Dynamic Environments from a Single Moving Camera , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Sanja Fidler,et al.  Lift, Splat, Shoot: Encoding Images From Arbitrary Camera Rigs by Implicitly Unprojecting to 3D , 2020, ECCV.

[21]  Juyong Zhang,et al.  AANet: Adaptive Aggregation Network for Efficient Stereo Matching , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Qingshan Xu,et al.  Learning Inverse Depth Regression for Multi-View Stereo with Correlation Cost Volume , 2019, AAAI.

[23]  Siyu Zhu,et al.  Cascade Cost Volume for High-Resolution Multi-View Stereo and Stereo Matching , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Qiang Xu,et al.  nuScenes: A Multimodal Dataset for Autonomous Driving , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Tom van Dijk,et al.  How Do Neural Networks See Depth in Single Images? , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[26]  Ruigang Yang,et al.  GA-Net: Guided Aggregation Net for End-To-End Stereo Matching , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Xiaogang Wang,et al.  Group-Wise Correlation Stereo Network , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Long Quan,et al.  Recurrent MVSNet for High-Resolution Multi-View Stereo Depth Inference , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Dacheng Tao,et al.  Deep Ordinal Regression Network for Monocular Depth Estimation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[30]  Long Quan,et al.  MVSNet: Depth Inference for Unstructured Multi-view Stereo , 2018, ECCV.

[31]  Yong-Sheng Chen,et al.  Pyramid Stereo Matching Network , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[32]  Alex Kendall,et al.  End-to-End Learning of Geometry and Context for Deep Stereo Regression , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[33]  Xuming He,et al.  Indoor scene structure analysis for single image depth estimation , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Xuming He,et al.  Discrete-Continuous Depth Estimation from a Single Image , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[35]  A. Ng,et al.  Make3D: Learning 3D Scene Structure from a Single Still Image , 2022 .