Hierarchical semantic parsing for object pose estimation in densely cluttered scenes

Densely cluttered scenes are composed of multiple objects which are in close contact and heavily occlude each other. Few existing 3D object recognition systems are capable of accurately predicting object poses in such scenarios. This is mainly due to the presence of objects with textureless surfaces, similar appearances and the difficulty of object instance segmentation. In this paper, we present a hierarchical semantic segmentation algorithm which partitions a densely cluttered scene into different object regions. A RANSAC-based registration method is subsequently applied to estimate 6-DoF object poses within each object class. Part of this algorithm includes a generalized pooling scheme used to construct robust and discriminative object representations from a convolutional architecture with multiple pooling domains. We also provide a new RGB-D dataset which serves as a benchmark for object pose estimation in densely cluttered scenes. This dataset contains five thousand scene frames and over twenty thousand labeled poses of ten common hand tools. We show that our method demonstrates improved performance of pose estimation on this new dataset compared with other state-of-the-art methods.

[1]  Vladimir Ivan,et al.  Real-time object pose recognition and tracking with an imprecisely calibrated moving RGB-D camera , 2014, 2014 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[2]  Dieter Fox,et al.  Object recognition with hierarchical kernel descriptors , 2011, CVPR 2011.

[3]  Markus Vincze,et al.  Multimodal cue integration through Hypotheses Verification for RGB-D object recognition and 6DOF pose estimation , 2013, 2013 IEEE International Conference on Robotics and Automation.

[4]  Nathan Silberman,et al.  Instance Segmentation of Indoor Scenes Using a Coverage Loss , 2014, ECCV.

[5]  Darius Burschka,et al.  An Efficient RANSAC for 3D Object Recognition in Noisy and Occluded Scenes , 2010, ACCV.

[6]  Pieter Abbeel,et al.  A textured object recognition pipeline for color and depth image data , 2012, 2012 IEEE International Conference on Robotics and Automation.

[7]  Dieter Fox,et al.  Unsupervised Feature Learning for RGB-D Based Object Recognition , 2012, ISER.

[8]  Gregory D. Hager,et al.  Bridging the Robot Perception Gap with Mid-Level Vision , 2015, ISRR.

[9]  Matthijs C. Dorst Distinctive Image Features from Scale-Invariant Keypoints , 2011 .

[10]  Sven Behnke,et al.  RGB-D object recognition and pose estimation based on pre-trained convolutional neural network features , 2015, 2015 IEEE International Conference on Robotics and Automation (ICRA).

[11]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[12]  Pieter Abbeel,et al.  BigBIRD: A large-scale 3D database of object instances , 2014, 2014 IEEE International Conference on Robotics and Automation (ICRA).

[13]  Markus Vincze,et al.  Fast semantic segmentation of 3D point clouds using a dense CRF with learned parameters , 2015, 2015 IEEE International Conference on Robotics and Automation (ICRA).

[14]  Vincent Lepetit,et al.  Model Based Training, Detection and Pose Estimation of Texture-Less 3D Objects in Heavily Cluttered Scenes , 2012, ACCV.

[15]  Dieter Fox,et al.  A large-scale hierarchical multi-view RGB-D object dataset , 2011, 2011 IEEE International Conference on Robotics and Automation.

[16]  Trevor Darrell,et al.  Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[17]  Pieter Abbeel,et al.  Multimodal blending for high-accuracy instance recognition , 2013, 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[18]  Jitendra Malik,et al.  Learning Rich Features from RGB-D Images for Object Detection and Segmentation , 2014, ECCV.

[19]  Mohammed Bennamoun,et al.  Efficient RGB-D object categorization using cascaded ensembles of randomized decision trees , 2015, 2015 IEEE International Conference on Robotics and Automation (ICRA).

[20]  Martin A. Riedmiller,et al.  A learned feature descriptor for object recognition in RGB-D data , 2012, 2012 IEEE International Conference on Robotics and Automation.

[21]  Markus Vincze,et al.  A Global Hypotheses Verification Method for 3D Object Recognition , 2012, ECCV.

[22]  Federico Tombari,et al.  A combined texture-shape descriptor for enhanced 3D feature matching , 2011, 2011 18th IEEE International Conference on Image Processing.

[23]  Federico Tombari,et al.  Object Recognition in 3D Scenes with Occlusions and Clutter by Hough Voting , 2010, 2010 Fourth Pacific-Rim Symposium on Image and Video Technology.

[24]  Cor J. Veenman,et al.  Visual Word Ambiguity , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[25]  Tae-Kyun Kim,et al.  Latent-Class Hough Forests for 3D Object Detection and Pose Estimation , 2014, ECCV.

[26]  Bastian Leibe,et al.  Dense 3D semantic mapping of indoor scenes from RGB-D images , 2014, 2014 IEEE International Conference on Robotics and Automation (ICRA).

[27]  Yann LeCun,et al.  Indoor Semantic Segmentation using depth information , 2013, ICLR.

[28]  Eric Brachmann,et al.  Learning 6D Object Pose Estimation Using 3D Object Coordinates , 2014, ECCV.

[29]  Gregory D. Hager,et al.  Beyond spatial pooling: Fine-grained representation learning in multiple domains , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Jitendra Malik,et al.  Perceptual Organization and Recognition of Indoor Scenes from RGB-D Images , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[31]  Radu Bogdan Rusu,et al.  Semantic 3D Object Maps for Everyday Manipulation in Human Living Environments , 2010, KI - Künstliche Intelligenz.

[32]  Florentin Wörgötter,et al.  Voxel Cloud Connectivity Segmentation - Supervoxels for Point Clouds , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.