Mask R-CNN Based Semantic RGB-D SLAM for Dynamic Scenes

Traditional visual SLAM algorithms run robustly under the assumption of a static environment, but always fail in dynamic scenarios, since moving objects will impair camera pose tracking. A novel semantic SLAM framework detecting potentially moving elements by Mask R-CNN to achieve robustness in dynamic scenes for RGB-D camera is proposed in this study. In the framework, semantic instance segmentation is designed to be an independent thread which runs in parallel with other three threads: tracking, local-mapping and loop-closing. While most methods only use multi-view geometry to determine whether results of segmentation are moving, the proposed method is to simultaneously estimate the camera motion and the possibility of dynamic/static parts. Experiments are performed to compare the proposed method with state-of-the-art approaches using TUM RGB-D datasets. Results demonstrate that the proposed method can improve accuracy of the absolute trajectory in dynamic scenes.

[1]  Davide Scaramuzza,et al.  SVO: Fast semi-direct monocular visual odometry , 2014, 2014 IEEE International Conference on Robotics and Automation (ICRA).

[2]  Stefan Leutenegger,et al.  SemanticFusion: Dense 3D semantic mapping with convolutional neural networks , 2016, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[3]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Daniel Cremers,et al.  Direct Sparse Odometry , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[5]  Daniel Cremers,et al.  Real-Time Dense Geometry from a Handheld Camera , 2010, DAGM-Symposium.

[6]  Andrew J. Davison,et al.  DTAM: Dense tracking and mapping in real-time , 2011, 2011 International Conference on Computer Vision.

[7]  Bernhard P. Wrobel,et al.  Multiple View Geometry in Computer Vision , 2001 .

[8]  Javier Civera,et al.  DynaSLAM: Tracking, Mapping, and Inpainting in Dynamic Scenes , 2018, IEEE Robotics and Automation Letters.

[9]  Friedrich Fraundorfer,et al.  Visual Odometry Part I: The First 30 Years and Fundamentals , 2022 .

[10]  Daniel Cremers,et al.  LSD-SLAM: Large-Scale Direct Monocular SLAM , 2014, ECCV.

[11]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[12]  Wolfram Burgard,et al.  3-D Mapping With an RGB-D Camera , 2014, IEEE Transactions on Robotics.

[13]  Hujun Bao,et al.  Robust monocular SLAM in dynamic environments , 2013, 2013 IEEE International Symposium on Mixed and Augmented Reality (ISMAR).

[14]  Qi Wei,et al.  DS-SLAM: A Semantic Visual SLAM towards Dynamic Environments , 2018, 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[15]  G. Klein,et al.  Parallel Tracking and Mapping for Small AR Workspaces , 2007, 2007 6th IEEE and ACM International Symposium on Mixed and Augmented Reality.

[16]  Kaiming He,et al.  Mask R-CNN , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[17]  Nikolas Brasch,et al.  Semantic Monocular SLAM for Highly Dynamic Environments , 2018, 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[18]  Wolfram Burgard,et al.  A benchmark for the evaluation of RGB-D SLAM systems , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[19]  Juan D. Tardós,et al.  ORB-SLAM2: An Open-Source SLAM System for Monocular, Stereo, and RGB-D Cameras , 2016, IEEE Transactions on Robotics.

[20]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[21]  J. M. M. Montiel,et al.  ORB-SLAM: A Versatile and Accurate Monocular SLAM System , 2015, IEEE Transactions on Robotics.