Localization and Mapping using Instance-specific Mesh Models

This paper focuses on building semantic maps, containing object poses and shapes, using a monocular camera. This is an important problem because robots need rich understanding of geometry and context if they are to shape the future of transportation, construction, and agriculture. Our contribution is an instance-specific mesh model of object shape that can be optimized online based on semantic information extracted from camera images. Multi-view constraints on the object shape are obtained by detecting objects and extracting category-specific keypoints and segmentation masks. We show that the errors between projections of the mesh model and the observed keypoints and masks can be differentiated in order to obtain accurate instance-specific object shapes. We evaluate the performance of the proposed approach in simulation and on the KITTI dataset by building maps of car poses and shapes.

[1]  G LoweDavid,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004 .

[2]  Trevor Darrell,et al.  Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[3]  Frank Dellaert,et al.  Covariance recovery from a square root information matrix for data association , 2009, Robotics Auton. Syst..

[4]  Juan D. Tardós,et al.  Data association in stochastic mapping using the joint compatibility test , 2001, IEEE Trans. Robotics Autom..

[5]  Roland Siegwart,et al.  Robust visual inertial odometry using a direct EKF-based approach , 2015, IROS 2015.

[6]  Ross B. Girshick,et al.  Mask R-CNN , 2017, 1703.06870.

[7]  Andrew J. Davison,et al.  FutureMapping: The Computational Structure of Spatial AI Systems , 2018, ArXiv.

[8]  Qi-Xing Huang,et al.  StarMap for Category-Agnostic Keypoint and Viewpoint Estimation , 2018, ECCV.

[9]  Silvio Savarese,et al.  Beyond PASCAL: A benchmark for 3D object detection in the wild , 2014, IEEE Winter Conference on Applications of Computer Vision.

[10]  Jitendra Malik,et al.  Learning Category-Specific Mesh Reconstruction from Image Collections , 2018, ECCV.

[11]  Jia Deng,et al.  Stacked Hourglass Networks for Human Pose Estimation , 2016, ECCV.

[12]  Stefan Leutenegger,et al.  ElasticFusion: Real-time dense SLAM and light source estimation , 2016, Int. J. Robotics Res..

[13]  Ming-Yu Liu,et al.  CASENet: Deep Category-Aware Semantic Edge Detection , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Michael Milford,et al.  QuadricSLAM: Constrained Dual Quadrics from Object Detections as Landmarks in Semantic SLAM , 2018, ArXiv.

[15]  Stergios I. Roumeliotis,et al.  A Multi-State Constraint Kalman Filter for Vision-aided Inertial Navigation , 2007, Proceedings 2007 IEEE International Conference on Robotics and Automation.

[16]  Richard A. Newcombe,et al.  Dense visual SLAM , 2012 .

[17]  Stefano Soatto,et al.  Visual-Inertial Object Detection and Mapping , 2018, ECCV.

[18]  George J. Pappas,et al.  Localization from semantic observations via the matrix permanent , 2016, Int. J. Robotics Res..

[19]  Tatsuya Harada,et al.  Neural 3D Mesh Renderer , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[20]  S. Shankar Sastry,et al.  An Invitation to 3-D Vision: From Images to Geometric Models , 2003 .

[21]  John J. Leonard,et al.  Past, Present, and Future of Simultaneous Localization and Mapping: Toward the Robust-Perception Age , 2016, IEEE Transactions on Robotics.

[22]  Michael Kaess,et al.  Simultaneous localization and mapping with infinite planes , 2015, 2015 IEEE International Conference on Robotics and Automation (ICRA).

[23]  Sean L. Bowman,et al.  Probabilistic data association for semantic SLAM , 2017, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[24]  Hao Su,et al.  A Point Set Generation Network for 3D Object Reconstruction from a Single Image , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Juan D. Tardós,et al.  ORB-SLAM2: An Open-Source SLAM System for Monocular, Stereo, and RGB-D Cameras , 2016, IEEE Transactions on Robotics.

[26]  David A. McAllester,et al.  Object Detection with Discriminatively Trained Part Based Models , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[27]  Daniel Cremers,et al.  LSD-SLAM: Large-Scale Direct Monocular SLAM , 2014, ECCV.

[28]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Shaojie Shen,et al.  Stereo Vision-based Semantic 3D Object and Ego-motion Tracking for Autonomous Driving , 2018, ECCV.

[30]  Xiaowei Zhou,et al.  6-DoF object pose from semantic keypoints , 2017, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[31]  G. Chirikjian Stochastic Models, Information Theory, and Lie Groups, Volume 2 , 2012 .

[32]  Javier Civera,et al.  Towards semantic SLAM using a monocular camera , 2011, 2011 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[33]  Vijay Kumar,et al.  Robust Stereo Visual Inertial Odometry for Fast Autonomous Flight , 2017, IEEE Robotics and Automation Letters.

[34]  Jitendra Malik,et al.  Learning Category-Specific Deformable 3D Models for Object Reconstruction , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[35]  Shaojie Shen,et al.  VINS-Mono: A Robust and Versatile Monocular Visual-Inertial State Estimator , 2017, IEEE Transactions on Robotics.

[36]  Olga Sorkine-Hornung,et al.  Differential Representations for Mesh Processing , 2006, Comput. Graph. Forum.

[37]  Honglak Lee,et al.  Perspective Transformer Nets: Learning Single-View 3D Object Reconstruction without 3D Supervision , 2016, NIPS.

[38]  Gary R. Bradski,et al.  ORB: An efficient alternative to SIFT or SURF , 2011, 2011 International Conference on Computer Vision.

[39]  Ian D. Reid,et al.  Towards Semantic SLAM: Points, Planes and Objects , 2018, ArXiv.

[40]  Sean L. Bowman,et al.  A Unifying View of Geometry, Semantics, and Data Association in SLAM , 2018, IJCAI.

[41]  Takeo Kanade,et al.  An Iterative Image Registration Technique with an Application to Stereo Vision , 1981, IJCAI.

[42]  Paul H. J. Kelly,et al.  SLAM++: Simultaneous Localisation and Mapping at the Level of Objects , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[43]  W. Kabsch A solution for the best rotation to relate two sets of vectors , 1976 .

[44]  Andreas Geiger,et al.  Vision meets robotics: The KITTI dataset , 2013, Int. J. Robotics Res..

[45]  Ulrich Pinkall,et al.  Computing Discrete Minimal Surfaces and Their Conjugates , 1993, Exp. Math..