M4Depth: A motion-based approach for monocular depth estimation on video sequences

Getting the distance to objects is crucial for autonomous vehicles. In instances where depth sensors cannot be used, this distance has to be estimated from RGB cameras. As opposed to cars, the task of estimating depth from on-board mounted cameras is made complex on drones because of the lack of constrains on motion during flights. In this paper, we present a method to estimate the distance of objects seen by an on-board mounted camera by using its RGB video stream and drone motion information. Our method is built upon a pyramidal convolutional neural network architecture and uses time recurrence in pair with geometric constraints imposed by motion to produce pixel-wise depth maps. In our architecture, each level of the pyramid is designed to produce its own depth estimate based on past observations and information provided by the previous level in the pyramid. We introduce a spatial reprojection layer to maintain the spatio-temporal consistency of the data between the levels. We analyse the performance of our approach on Mid-Air, a public drone dataset featuring synthetic drone trajectories recorded in a wide variety of unstructured outdoor environments. Our experiments show that our network outperforms state-of-the-art depth estimation methods and that the use of motion information is the main contributing factor for this improvement. The code of our method is publicly available on GitHub; see https: //github.com/michael-fonder/M4Depth

[1]  Rob Fergus,et al.  Depth Map Prediction from a Single Image using a Multi-Scale Deep Network , 2014, NIPS.

[2]  Thomas Brox,et al.  DeMoN: Depth and Motion Network for Learning Monocular Stereo , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Ying Li,et al.  Exploiting Temporal Consistency for Real-Time Video Depth Estimation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[4]  Marc Pollefeys,et al.  DeepVideoMVS: Multi-View Stereo on Video with Recurrent Spatio-Temporal Fusion , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Richard Szeliski,et al.  Consistent video depth estimation , 2020, ACM Trans. Graph..

[6]  Oisin Mac Aodha,et al.  Unsupervised Monocular Depth Estimation with Left-Right Consistency , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Gabriel J. Brostow,et al.  Digging Into Self-Supervised Monocular Depth Estimation , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[8]  Wolfram Burgard,et al.  A benchmark for the evaluation of RGB-D SLAM systems , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[9]  Thomas Brox,et al.  U-Net: Convolutional Networks for Biomedical Image Segmentation , 2015, MICCAI.

[10]  Ronen Basri,et al.  A Survey on Structure from Motion , 2017, ArXiv.

[11]  G. Wei,et al.  Monocular Depth Estimation Based on Deep Learning:A Survey , 2020, 2020 Chinese Automation Congress (CAC).

[12]  Suchendra M. Bhandarkar,et al.  DepthNet: A Recurrent Neural Network Architecture for Monocular Depth Prediction , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[13]  Long Quan,et al.  MVSNet: Depth Inference for Unstructured Multi-view Stereo , 2018, ECCV.

[14]  Roland Siegwart,et al.  The EuRoC micro aerial vehicle datasets , 2016, Int. J. Robotics Res..

[15]  Antonio M. López,et al.  The SYNTHIA Dataset: A Large Collection of Synthetic Images for Semantic Segmentation of Urban Scenes , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Yue Ming,et al.  Deep learning for monocular depth estimation: A review , 2021, Neurocomputing.

[17]  Marc Van Droogenbroeck,et al.  Mid-Air: A Multi-Modal Dataset for Extremely Low Altitude Drone Flights , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[18]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[19]  Dengxin Dai,et al.  Don’t Forget The Past: Recurrent Depth Estimation from Monocular Video , 2020, IEEE Robotics and Automation Letters.

[20]  Wei Xu,et al.  Every Pixel Counts ++: Joint Learning of Geometry and Motion with 3D Holistic Understanding , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[21]  Yang Tang,et al.  Monocular depth estimation based on deep learning: An overview , 2020, Science China Technological Sciences.

[22]  Anelia Angelova,et al.  Unsupervised Learning of Depth and Ego-Motion from Monocular Video Using 3D Geometric Constraints , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[23]  Jian Sun,et al.  Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[24]  Jan-Michael Frahm,et al.  Real-Time Plane-Sweeping Stereo with Multiple Sweeping Directions , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[25]  Jan Kautz,et al.  PWC-Net: CNNs for Optical Flow Using Pyramid, Warping, and Cost Volume , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[26]  Andreas Geiger,et al.  Are we ready for autonomous driving? The KITTI vision benchmark suite , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[27]  Jia Deng,et al.  DeepV2D: Video to Depth with Differentiable Structure from Motion , 2018, ICLR.

[28]  Davide Scaramuzza,et al.  Visual-Inertial Odometry of Aerial Robots , 2019, ArXiv.

[29]  Jan-Michael Frahm,et al.  Recurrent Neural Network for (Un-)Supervised Learning of Monocular Video Visual Odometry and Depth , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Robert T. Collins,et al.  A space-sweep approach to true multi-image matching , 1996, Proceedings CVPR IEEE Computer Society Conference on Computer Vision and Pattern Recognition.