Indoor Relocalization in Challenging Environments With Dual-Stream Convolutional Neural Networks

This paper presents an indoor relocalization system using a dual-stream convolutional neural network (CNN) with both color images and depth images as the network inputs. Aiming at the pose regression problem, a deep neural network architecture for RGB-D images is introduced, a training method by stages for the dual-stream CNN is presented, different depth image encoding methods are discussed, and a novel encoding method is proposed. By introducing the range information into the network through a dual-stream architecture, we not only improved the relocalization accuracy by about 20% compared with the state-of-the-art deep learning method for pose regression, but also greatly enhanced the system robustness in challenging scenes such as large-scale, dynamic, fast movement, and night-time environments. To the best of our knowledge, this is the first work to solve the indoor relocalization problems based on deep CNNs with RGB-D camera. The method is first evaluated on the Microsoft 7-Scenes data set to show its advantage in accuracy compared with other CNNs. Large-scale indoor relocalization is further presented using our method. The experimental results show that 0.3 m in position and 4° in orientation accuracy could be obtained. Finally, this method is evaluated on challenging indoor data sets collected from motion capture system. The results show that the relocalization performance is hardly affected by dynamic objects, motion blur, or night-time environments.Note to Practitioners—This paper was motivated by the limitations of the existing indoor relocalization technology that is significant for mobile robot navigation. Using this technology, robots can infer where they are in a previously visited place. Previous visual localization methods can hardly be put into wide application for the reason that they have strict requirements for the environments. When faced with challenging scenes such as large-scale environments, dynamic objects, motion blur caused by fast movement, night-time environments, or other appearance changed scenes, most existing methods tend to fail. This paper introduces deep learning into the indoor relocalization problem and uses dual-stream CNN (depth stream and color stream) to realize 6-DOF pose regression in an end-to-end manner. The localization error is about 0.3 m and 4° in a large-scale indoor environments. And what is more important, the proposed system does not lose efficiency in some challenging scenes. The proposed encoding method of depth images can also be adopted in other deep neural networks with RGB-D cameras as the sensor.

[1]  Roberto Cipolla,et al.  Modelling uncertainty in deep learning for camera relocalization , 2015, 2016 IEEE International Conference on Robotics and Automation (ICRA).

[2]  Sven Behnke,et al.  RGB-D object recognition and pose estimation based on pre-trained convolutional neural network features , 2015, 2015 IEEE International Conference on Robotics and Automation (ICRA).

[3]  Lawrence D. Jackel,et al.  Backpropagation Applied to Handwritten Zip Code Recognition , 1989, Neural Computation.

[4]  Ian D. Reid,et al.  Article in Press Robotics and Autonomous Systems ( ) – Robotics and Autonomous Systems a Comparison of Loop Closing Techniques in Monocular Slam , 2022 .

[5]  Vincent Lepetit,et al.  Multimodal templates for real-time detection of texture-less objects in heavily cluttered scenes , 2011, 2011 International Conference on Computer Vision.

[6]  Roberto Cipolla,et al.  PoseNet: A Convolutional Network for Real-Time 6-DOF Camera Relocalization , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[7]  Dongbing Gu,et al.  A novel RGB-D SLAM algorithm based on points and plane-patches , 2016, 2016 IEEE International Conference on Automation Science and Engineering (CASE).

[8]  Bolei Zhou,et al.  Learning Deep Features for Scene Recognition using Places Database , 2014, NIPS.

[9]  Javier González,et al.  Scene structure registration for localization and mapping , 2016, Robotics Auton. Syst..

[10]  Yann LeCun,et al.  Indoor Semantic Segmentation using depth information , 2013, ICLR.

[11]  Qiang Chen,et al.  Network In Network , 2013, ICLR.

[12]  Peter I. Corke,et al.  Visual Place Recognition: A Survey , 2016, IEEE Transactions on Robotics.

[13]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[14]  Paul Newman,et al.  FAB-MAP: Probabilistic Localization and Mapping in the Space of Appearance , 2008, Int. J. Robotics Res..

[15]  Dongbing Gu,et al.  Night-time indoor relocalization using depth image with Convolutional Neural Networks , 2016, 2016 22nd International Conference on Automation and Computing (ICAC).

[16]  Wolfram Burgard,et al.  Multimodal deep learning for robust RGB-D object recognition , 2015, 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[17]  Damir Filko,et al.  Place recognition based on matching of planar surfaces and line segments , 2015, Int. J. Robotics Res..

[18]  Paul J. Besl,et al.  A Method for Registration of 3-D Shapes , 1992, IEEE Trans. Pattern Anal. Mach. Intell..

[19]  Andrew W. Fitzgibbon,et al.  Scene Coordinate Regression Forests for Camera Relocalization in RGB-D Images , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[20]  Juan D. Tardós,et al.  Fast relocalisation and loop closing in keyframe-based SLAM , 2014, 2014 IEEE International Conference on Robotics and Automation (ICRA).

[21]  Paul Newman,et al.  Appearance-only SLAM at large scale with FAB-MAP 2.0 , 2011, Int. J. Robotics Res..

[22]  J. M. M. Montiel,et al.  ORB-SLAM: A Versatile and Accurate Monocular SLAM System , 2015, IEEE Transactions on Robotics.

[23]  David Nistér,et al.  Scalable Recognition with a Vocabulary Tree , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[24]  Amirhossein Tamjidi,et al.  6-DOF Pose Estimation of a Robotic Navigation Aid by Tracking Visual and Geometric Features , 2015, IEEE Transactions on Automation Science and Engineering.

[25]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[26]  Michael Milford,et al.  Convolutional Neural Network-based Place Recognition , 2014, ICRA 2014.

[27]  Dongbing Gu,et al.  Extracting Semantic Information from Visual Data: A Survey , 2016, Robotics.

[28]  Niko Sünderhauf,et al.  On the performance of ConvNet features for place recognition , 2015, 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[29]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Jitendra Malik,et al.  Learning Rich Features from RGB-D Images for Object Detection and Segmentation , 2014, ECCV.

[31]  Honglak Lee,et al.  Deep learning for detecting robotic grasps , 2013, Int. J. Robotics Res..

[32]  Dorian Gálvez-López,et al.  Bags of Binary Words for Fast Place Recognition in Image Sequences , 2012, IEEE Transactions on Robotics.

[33]  Michael Milford,et al.  Place Recognition with ConvNet Landmarks: Viewpoint-Robust, Condition-Robust, Training-Free , 2015, Robotics: Science and Systems.

[34]  Hugh F. Durrant-Whyte,et al.  Simultaneous localization and mapping: part I , 2006, IEEE Robotics & Automation Magazine.

[35]  Hugh Durrant-Whyte,et al.  Simultaneous localization and mapping (SLAM): part II , 2006 .

[36]  Wolfram Burgard,et al.  Robust place recognition for 3D range data based on point features , 2010, 2010 IEEE International Conference on Robotics and Automation.