Unsupervised Depth Completion From Visual Inertial Odometry

We describe a method to infer dense depth from camera motion and sparse depth as estimated using a visual-inertial odometry system. Unlike other scenarios using point clouds from lidar or structured light sensors, we have few hundreds to few thousand points, insufficient to inform the topology of the scene. Our method first constructs a piecewise planar scaffolding of the scene, and then uses it to infer dense depth using the image along with the sparse points. We use a predictive cross-modal criterion, akin to “self-supervision,” measuring photometric consistency across time, forward-backward pose consistency, and geometric compatibility with the sparse point cloud. We also present the first visual-inertial + depth dataset, which we hope will foster additional exploration into combining the complementary strengths of visual and inertial sensors. To compare our method to prior work, we adopt the unsupervised KITTI depth completion benchmark, where we achieve state-of-the-art performance.

[1]  Jörg Stückler,et al.  The TUM VI Benchmark for Evaluating Visual-Inertial Odometry , 2018, 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[2]  Stefano Soatto,et al.  Dense Depth Posterior (DDP) From Single Image and Sparse Range , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Eero P. Simoncelli,et al.  Image quality assessment: from error visibility to structural similarity , 2004, IEEE Transactions on Image Processing.

[4]  Kevin Q. Brown,et al.  Voronoi Diagrams from Convex Hulls , 1979, Inf. Process. Lett..

[5]  Stergios I. Roumeliotis,et al.  A Multi-State Constraint Kalman Filter for Vision-aided Inertial Navigation , 2007, Proceedings 2007 IEEE International Conference on Robotics and Automation.

[6]  Zhichao Yin,et al.  GeoNet: Unsupervised Learning of Dense Depth, Optical Flow and Camera Pose , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[7]  Robert C. Bolles,et al.  Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography , 1981, CACM.

[8]  Noah Snavely,et al.  Unsupervised Learning of Depth and Ego-Motion from Video , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  V. Lepetit,et al.  EPnP: An Accurate O(n) Solution to the PnP Problem , 2009, International Journal of Computer Vision.

[10]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[11]  Sertac Karaman,et al.  Self-Supervised Sparse-to-Dense: Self-Supervised Depth Completion from LiDAR and Monocular Camera , 2018, 2019 International Conference on Robotics and Automation (ICRA).

[12]  Wilfried Philips,et al.  Learning Morphological Operators for Depth Completion , 2018, ACIVS.

[13]  Thomas Brox,et al.  Sparsity Invariant CNNs , 2017, 2017 International Conference on 3D Vision (3DV).

[14]  Yinda Zhang,et al.  Deep Depth Completion of a Single RGB-D Image , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[15]  S. Shankar Sastry,et al.  An Invitation to 3-D Vision: From Images to Geometric Models , 2003 .

[16]  Simon Lucey,et al.  Deep Convolutional Compressed Sensing for LiDAR Depth Completion , 2018, ACCV.

[17]  Camillo J. Taylor,et al.  DFuseNet: Deep Fusion of RGB and Sparse Depth Information for Image Guided Dense Depth Completion , 2019, 2019 IEEE Intelligent Transportation Systems Conference (ITSC).

[18]  Kostas Daniilidis,et al.  PennCOSYVIO: A challenging Visual Inertial Odometry benchmark , 2017, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[19]  Stefano Soatto,et al.  Geo-Supervised Visual Depth Prediction , 2018, IEEE Robotics and Automation Letters.

[20]  Michael Felsberg,et al.  Propagating Confidences through CNNs for Sparse Data Regression , 2018, BMVC.

[21]  Wolfram Burgard,et al.  A benchmark for the evaluation of RGB-D SLAM systems , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[22]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[23]  Derek Hoiem,et al.  Indoor Segmentation and Support Inference from RGBD Images , 2012, ECCV.

[24]  Stefano Soatto,et al.  Visual-inertial navigation, mapping and localization: A scalable real-time causal approach , 2011, Int. J. Robotics Res..

[25]  David P. Dobkin,et al.  The quickhull algorithm for convex hulls , 1996, TOMS.

[26]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[27]  Roberto Cipolla,et al.  PoseNet: A Convolutional Network for Real-Time 6-DOF Camera Relocalization , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[28]  Simon Lucey,et al.  Learning Depth from Monocular Videos Using Direct Methods , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[29]  Marc Pollefeys,et al.  Semantically Guided Depth Upsampling , 2016, GCPR.

[30]  Xiaogang Wang,et al.  HMS-Net: Hierarchical Multi-Scale Sparsity-Invariant Network for Sparse Depth Completion , 2018, IEEE Transactions on Image Processing.

[31]  Fawzi Nashashibi,et al.  Sparse and Dense Data with CNNs: Depth Completion and Semantic Segmentation , 2018, 2018 International Conference on 3D Vision (3DV).

[32]  Anthony J. Yezzi,et al.  A Compact Formula for the Derivative of a 3-D Rotation in Exponential Coordinates , 2013, Journal of Mathematical Imaging and Vision.