3D Scene Mesh from CNN Depth Predictions and Sparse Monocular SLAM

In this paper, we propose a novel framework for integrating geometrical measurements of monocular visual simultaneous localization and mapping (SLAM) and depth prediction using a convolutional neural network (CNN). In our framework, SLAM-measured sparse features and CNN-predicted dense depth maps are fused to obtain a more accurate dense 3D reconstruction including scale. We continuously update an initial 3D mesh by integrating accurately tracked sparse features points. Compared to prior work on integrating SLAM and CNN estimates [26], there are two main differences: Using a 3D mesh representation allows as-rigid-as-possible update transformations. We further propose a system architecture suitable for mobile devices, where feature tracking and CNN-based depth prediction modules are separated, and only the former is run on the device. We evaluate the framework by comparing the 3D reconstruction result with 3D measurements obtained using an RGBD sensor, showing a reduction in the mean residual error of 38% compared to CNN-based depth map prediction alone.

[1]  Alan L. Yuille,et al.  Towards unified depth and semantic prediction from a single image , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  David W. Murray,et al.  Mobile Robot Localisation Using Active Vision , 1998, ECCV.

[3]  Horst Bischof,et al.  Efficient 3D scene abstraction using line segments , 2017, Comput. Vis. Image Underst..

[4]  Andrew J. Davison,et al.  DTAM: Dense tracking and mapping in real-time , 2011, 2011 International Conference on Computer Vision.

[5]  Stefan Leutenegger,et al.  ElasticFusion: Real-time dense SLAM and light source estimation , 2016, Int. J. Robotics Res..

[6]  G. Klein,et al.  Parallel Tracking and Mapping for Small AR Workspaces , 2007, 2007 6th IEEE and ACM International Symposium on Mixed and Augmented Reality.

[7]  Jianxiong Xiao,et al.  3D ShapeNets: A deep representation for volumetric shapes , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Olivier Stasse,et al.  MonoSLAM: Real-Time Single Camera SLAM , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9]  Derek Hoiem,et al.  Indoor Segmentation and Support Inference from RGBD Images , 2012, ECCV.

[10]  Daniel Cremers,et al.  LSD-SLAM: Large-Scale Direct Monocular SLAM , 2014, ECCV.

[11]  Roland Siegwart,et al.  A robust and modular multi-sensor fusion approach applied to MAV navigation , 2013, 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[12]  Marie Schmidt,et al.  Matchmoving The Invisible Art Of Camera Tracking , 2016 .

[13]  Marc Alexa,et al.  As-rigid-as-possible surface modeling , 2007, Symposium on Geometry Processing.

[14]  Tim Weyrich,et al.  Real-Time 3D Reconstruction in Dynamic Scenes Using Point-Based Fusion , 2013, 2013 International Conference on 3D Vision.

[15]  Daniel Cremers,et al.  Semi-dense Visual Odometry for a Monocular Camera , 2013, 2013 IEEE International Conference on Computer Vision.

[16]  Guosheng Lin,et al.  Deep convolutional neural fields for depth estimation from a single image , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  J. M. M. Montiel,et al.  ORB-SLAM: A Versatile and Accurate Monocular SLAM System , 2015, IEEE Transactions on Robotics.

[18]  Chunhua Shen,et al.  Depth and surface normal estimation from monocular images using regression on deep features and hierarchical CRFs , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Richard Szeliski,et al.  Building Rome in a day , 2009, ICCV.

[20]  Craig Gotsman,et al.  Smooth Rotation Enhanced As-Rigid-As-Possible Mesh Animation , 2015, IEEE Transactions on Visualization and Computer Graphics.

[21]  Nassir Navab,et al.  Deeper Depth Prediction with Fully Convolutional Residual Networks , 2016, 2016 Fourth International Conference on 3D Vision (3DV).

[22]  Marc Pollefeys,et al.  Turning Mobile Phones into 3D Scanners , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[23]  Rob Fergus,et al.  Depth Map Prediction from a Single Image using a Multi-Scale Deep Network , 2014, NIPS.

[24]  Andrew W. Fitzgibbon,et al.  KinectFusion: Real-time dense surface mapping and tracking , 2011, 2011 10th IEEE International Symposium on Mixed and Augmented Reality.

[25]  Olga Sorkine-Hornung,et al.  On Linear Variational Surface Deformation Methods , 2008, IEEE Transactions on Visualization and Computer Graphics.

[26]  Federico Tombari,et al.  CNN-SLAM: Real-Time Dense Monocular SLAM with Learned Depth Prediction , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Marc Levoy,et al.  A volumetric method for building complex models from range images , 1996, SIGGRAPH.