MSDPN: Monocular Depth Prediction with Partial Laser Observation using Multi-stage Neural Networks

In this study, a deep-learning-based multi-stage network architecture called Multi-Stage Depth Prediction Network (MSDPN) is proposed to predict a dense depth map using a 2D LiDAR and a monocular camera. Our proposed network consists of a multi-stage encoder-decoder architecture and Cross Stage Feature Aggregation (CSFA). The proposed multi-stage encoder-decoder architecture alleviates the partial observation problem caused by the characteristics of a 2D LiDAR, and CSFA prevents the multi-stage network from diluting the features and allows the network to learn the inter-spatial relationship between features better. Previous works use sub-sampled data from the ground truth as an input rather than actual 2D LiDAR data. In contrast, our approach trains the model and conducts experiments with a physically-collected 2D LiDAR dataset. To this end, we acquired our own dataset called KAIST RGBD-scan dataset and validated the effectiveness and the robustness of MSDPN under realistic conditions. As verified experimentally, our network yields promising performance against state-of-the-art methods. Additionally, we analyzed the performance of different input methods and confirmed that the reference depth map is robust in untrained scenarios.

[1]  Hugh F. Durrant-Whyte,et al.  Simultaneous localization and mapping: part I , 2006, IEEE Robotics & Automation Magazine.

[2]  Paolo Valigi,et al.  Fast robust monocular depth estimation for Obstacle Detection with fully convolutional networks , 2016, 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[3]  Gang Yu,et al.  Rethinking on Multi-Stage Networks for Human Pose Estimation , 2019, ArXiv.

[4]  Ashutosh Saxena,et al.  3-D Depth Reconstruction from a Single Still Image , 2007, International Journal of Computer Vision.

[5]  Sertac Karaman,et al.  Sparse-to-Dense: Depth Prediction from Sparse Depth Samples and a Single Image , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[6]  Ruigang Yang,et al.  Depth Estimation via Affinity Learned with Convolutional Spatial Propagation Network , 2018, ECCV.

[7]  Hyun Myung,et al.  One-way ViSP (Visually Servoed Paired structured light system) for structural displacement monitoring , 2017 .

[8]  Wolfram Burgard,et al.  Multimodal deep learning for robust RGB-D object recognition , 2015, 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[9]  Gustavo Carneiro,et al.  Unsupervised CNN for Single View Depth Estimation: Geometry to the Rescue , 2016, ECCV.

[10]  Yinda Zhang,et al.  ActiveStereoNet: End-to-End Self-Supervised Learning for Active Stereo Systems , 2018, ECCV.

[11]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[12]  Jianing Qian,et al.  FusionMapping: Learning Depth Prediction with Monocular Images and 2D Laser Scans , 2019, ArXiv.

[13]  In So Kweon,et al.  Depth Completion with Deep Geometry and Context Guidance , 2019, 2019 International Conference on Robotics and Automation (ICRA).

[14]  Nassir Navab,et al.  Deeper Depth Prediction with Fully Convolutional Residual Networks , 2016, 2016 Fourth International Conference on 3D Vision (3DV).

[15]  Rob Fergus,et al.  Depth Map Prediction from a Single Image using a Multi-Scale Deep Network , 2014, NIPS.

[16]  Jia Deng,et al.  Stacked Hourglass Networks for Human Pose Estimation , 2016, ECCV.

[17]  Cyrill Stachniss,et al.  SuMa++: Efficient LiDAR-based Semantic SLAM , 2019, 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[18]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Ryan M. Eustice,et al.  University of Michigan North Campus long-term vision and lidar dataset , 2016, Int. J. Robotics Res..

[20]  Hyun Myung,et al.  Depth-hybrid speeded-up robust features (DH-SURF) for real-time RGB-D SLAM , 2018 .

[21]  Tian Xia,et al.  Vehicle Detection from 3D Lidar Using Fully Convolutional Network , 2016, Robotics: Science and Systems.

[22]  Yong Liu,et al.  Parse geometry from a line: Monocular depth estimation with partial laser observation , 2016, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[23]  Andreas Geiger,et al.  Are we ready for autonomous driving? The KITTI vision benchmark suite , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[24]  Hyun Myung,et al.  RGB-D and Magnetic Sequence-based Graph SLAM with Kidnap Recovery , 2018, 2018 18th International Conference on Control, Automation and Systems (ICCAS).

[25]  Hugh Durrant-Whyte,et al.  Simultaneous localization and mapping (SLAM): part II , 2006 .

[26]  Oisin Mac Aodha,et al.  Unsupervised Monocular Depth Estimation with Left-Right Consistency , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Christos Papachristos,et al.  Vision-Depth Landmarks and Inertial Fusion for Navigation in Degraded Visual Environments , 2018, ISVC.

[28]  Hyun Myung,et al.  GP-ICP: Ground Plane ICP for Mobile Robots , 2019, IEEE Access.

[29]  Chunhua Shen,et al.  Estimating Depth From Monocular Images as Classification Using Deep Fully Convolutional Residual Networks , 2016, IEEE Transactions on Circuits and Systems for Video Technology.

[30]  Robert B. Fisher,et al.  A low-cost range finder using a visually located, structured light source , 1999, Second International Conference on 3-D Digital Imaging and Modeling (Cat. No.PR00062).

[31]  Andrea Cherubini,et al.  Autonomous Visual Navigation and Laser-Based Moving Obstacle Avoidance , 2014, IEEE Transactions on Intelligent Transportation Systems.