LiDARTouch: Monocular metric depth estimation with a few-beam LiDAR

Vision-based depth estimation is a key feature in autonomous systems, which often relies on a single camera or several independent ones. In such a monocular setup, dense depth is obtained with either additional input from one or several expensive LiDARs, e.g., with 64 beams, or camera-only methods, which suffer from scale-ambiguity and infinite-depth problems. In this paper, we propose a new alternative of densely estimating metric depth by combining a monocular camera with a light-weight LiDAR, e.g., with 4 beams, typical of today’s automotive-grade mass-produced laser scanners. Inspired by recent self-supervised methods, we introduce a novel framework, called LiDARTouch, to estimate dense depth maps from monocular images with the help of “touches” of LiDAR, i.e., without the need for dense ground-truth depth. In our setup, the minimal LiDAR input contributes on three different levels: as an additional model’s input, in a self-supervised LiDAR reconstruction objective function, and to estimate changes of pose (a key component of self-supervised depth estimation architectures). Our LiDARTouch framework achieves new state of the art in self-supervised depth estimation on the KITTI dataset, thus supporting our choices of integrating the very sparse LiDAR signal with other visual features. Moreover, we show that the use of a few-beam LiDAR alleviates scale ambiguity and infinite-depth issues that camera-only methods suffer from. We also demonstrate that methods from the fully-supervised depth-completion literature can be adapted to a self-supervised regime with a minimal LiDAR signal.

[1]  Terrence J. Sejnowski,et al.  Unsupervised Learning , 2018, Encyclopedia of GIS.

[2]  Kyungdon Joo,et al.  Non-Local Spatial Propagation Network for Depth Completion , 2020, ECCV.

[3]  David H. Douglas,et al.  ALGORITHMS FOR THE REDUCTION OF THE NUMBER OF POINTS REQUIRED TO REPRESENT A DIGITIZED LINE OR ITS CARICATURE , 1973 .

[4]  Felix Heide,et al.  Pixel-Accurate Depth Evaluation in Realistic Driving Scenarios , 2019, 2019 International Conference on 3D Vision (3DV).

[5]  Sanja Fidler,et al.  Lift, Splat, Shoot: Encoding Images From Arbitrary Camera Rigs by Implicitly Unprojecting to 3D , 2020, ECCV.

[6]  K. Madhava Krishna,et al.  INFER: INtermediate representations for FuturE pRediction , 2019, 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[7]  Zhichao Yin,et al.  GeoNet: Unsupervised Learning of Dense Depth, Optical Flow and Camera Pose , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[8]  Hongdong Li,et al.  Noise-Aware Unsupervised Deep Lidar-Stereo Fusion , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Simon Lucey,et al.  Argoverse: 3D Tracking and Forecasting With Rich Maps , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Dacheng Tao,et al.  Deep Ordinal Regression Network for Monocular Depth Estimation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[11]  Raoul de Charette,et al.  xMUDA: Cross-Modal Unsupervised Domain Adaptation for 3D Semantic Segmentation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Hujun Bao,et al.  Depth Completion From Sparse LiDAR Data With Depth-Normal Constraints , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[13]  Alex Kendall,et al.  End-to-End Learning of Geometry and Context for Deep Stereo Regression , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[14]  Rob Fergus,et al.  Depth Map Prediction from a Single Image using a Multi-Scale Deep Network , 2014, NIPS.

[15]  Dacheng Tao,et al.  Adaptive Context-Aware Multi-Modal Network for Depth Completion , 2020, IEEE Transactions on Image Processing.

[16]  Cedric Nishan Canagarajah,et al.  Structural Similarity-Based Object Tracking in Video Sequences , 2006, 2006 9th International Conference on Information Fusion.

[17]  Rares Ambrus,et al.  Semantically-Guided Representation Learning for Self-Supervised Monocular Depth , 2020, ICLR.

[18]  Joseph E. Gonzalez,et al.  BEV-Seg: Bird's Eye View Semantic Segmentation Using Geometry and Semantic Point Cloud , 2020, ArXiv.

[19]  Oisin Mac Aodha,et al.  Unsupervised Monocular Depth Estimation with Left-Right Consistency , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Qiang Xu,et al.  nuScenes: A Multimodal Dataset for Autonomous Driving , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Longin Jan Latecki,et al.  Unsupervised object region proposals for RGB-D indoor scenes , 2017, Comput. Vis. Image Underst..

[22]  Thomas Mensink,et al.  On the Benefit of Adversarial Training for Monocular Depth Estimation , 2019, Comput. Vis. Image Underst..

[23]  Simon Lucey,et al.  Learning Depth from Monocular Videos Using Direct Methods , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[24]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[25]  Jörg Stückler,et al.  Semi-Supervised Deep Learning for Monocular Depth Map Prediction , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Gérard G. Medioni,et al.  RGB-D camera based wearable navigation system for the visually impaired , 2016, Comput. Vis. Image Underst..

[27]  Sertac Karaman,et al.  Sparse-to-Dense: Depth Prediction from Sparse Depth Samples and a Single Image , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[28]  G LoweDavid,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004 .

[29]  Fawzi Nashashibi,et al.  Sparse and Dense Data with CNNs: Depth Completion and Semantic Segmentation , 2018, 2018 International Conference on 3D Vision (3DV).

[30]  Yong-Sheng Chen,et al.  Pyramid Stereo Matching Network , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[31]  Anelia Angelova,et al.  Unsupervised Learning of Depth and Ego-Motion from Monocular Video Using 3D Geometric Constraints , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[32]  Hong Zhang,et al.  Semi-Supervised Monocular Depth Estimation with Left-Right Consistency Using Deep Neural Network , 2019, 2019 IEEE International Conference on Robotics and Biomimetics (ROBIO).

[33]  Sebastian Ramos,et al.  The Cityscapes Dataset for Semantic Urban Scene Understanding , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Jianliang Tang,et al.  Complete Solution Classification for the Perspective-Three-Point Problem , 2003, IEEE Trans. Pattern Anal. Mach. Intell..

[35]  Anelia Angelova,et al.  Depth Prediction Without the Sensors: Leveraging Structure for Unsupervised Learning from Monocular Videos , 2018, AAAI.

[36]  Senthil Yogamani,et al.  Monocular Fisheye Camera Depth Estimation Using Sparse LiDAR Supervision , 2018, 2018 21st International Conference on Intelligent Transportation Systems (ITSC).

[37]  Gabriel J. Brostow,et al.  Self-Supervised Monocular Depth Hints , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[38]  Noah Snavely,et al.  Unsupervised Learning of Depth and Ego-Motion from Video , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  Jie Tang,et al.  Learning Guided Convolutional Network for Depth Completion , 2019, IEEE Transactions on Image Processing.

[40]  Rohit Mohan,et al.  EfficientPS: Efficient Panoptic Segmentation , 2020, International Journal of Computer Vision.

[41]  Anelia Angelova,et al.  Unsupervised Monocular Depth and Ego-Motion Learning With Structure and Semantics , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[42]  Jie Li,et al.  Robust Semi-Supervised Monocular Depth Estimation with Reprojected Distances , 2019, CoRL.

[43]  Toshiaki Fujii,et al.  Adversarial Patch Attacks on Monocular Depth Estimation Networks , 2020, IEEE Access.

[44]  Nan Yang,et al.  Learning Monocular 3D Vehicle Detection Without 3D Bounding Box Labels , 2020, GCPR.

[45]  Rares Ambrus,et al.  3D Packing for Self-Supervised Monocular Depth Estimation , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[46]  Sertac Karaman,et al.  Self-Supervised Sparse-to-Dense: Self-Supervised Depth Completion from LiDAR and Monocular Camera , 2018, 2019 International Conference on Robotics and Automation (ICRA).

[47]  V. Lepetit,et al.  EPnP: An Accurate O(n) Solution to the PnP Problem , 2009, International Journal of Computer Vision.

[48]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[49]  Sergio Casas,et al.  End-To-End Interpretable Neural Motion Planner , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[50]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[51]  Andreas Geiger,et al.  Are we ready for autonomous driving? The KITTI vision benchmark suite , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[52]  Thomas Brox,et al.  Sparsity Invariant CNNs , 2017, 2017 International Conference on 3D Vision (3DV).