A Unified Framework for Multi-View Multi-Class Object Pose Estimation

One core challenge in object pose estimation is to ensure accurate and robust performance for large numbers of diverse foreground objects amidst complex background clutter. In this work, we present a scalable framework for accurately inferring six Degree-of-Freedom (6-DoF) pose for a large number of object classes from single or multiple views. To learn discriminative pose features, we integrate three new capabilities into a deep Convolutional Neural Network (CNN): an inference scheme that combines both classification and pose regression based on a uniform tessellation of the Special Euclidean group in three dimensions (SE(3)), the fusion of class priors into the training process via a tiled class map, and an additional regularization using deep supervision with an object mask. Further, an efficient multi-view framework is formulated to address single-view ambiguity. We show that this framework consistently improves the performance of the single-view network. We evaluate our method on three large-scale benchmarks: YCB-Video, JHUScene-50 and ObjectNet-3D. Our approach achieves competitive or superior performance over the current state-of-the-art methods.

[1]  Ulrich Schwanecke,et al.  Real-Time Monocular Pose Estimation of 3D Objects Using Temporally Consistent Local Color Histograms , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[2]  Gregory D. Hager,et al.  Deep Supervision with Shape Concepts for Occlusion-Aware 3D Object Parsing , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Subhransu Maji,et al.  Multi-view Convolutional Neural Networks for 3D Shape Recognition , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[4]  Gregory S. Chirikjian,et al.  Pose Changes From a Different Point of View , 2017 .

[5]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[6]  Andreas Geiger,et al.  Are we ready for autonomous driving? The KITTI vision benchmark suite , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[7]  Leonidas J. Guibas,et al.  Render for CNN: Viewpoint Estimation in Images Using CNNs Trained with Rendered 3D Model Views , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[8]  Leonidas J. Guibas,et al.  ObjectNet3D: A Large Scale Database for 3D Object Recognition , 2016, ECCV.

[9]  Eric Brachmann,et al.  Global Hypothesis Generation for 6D Object Pose Estimation , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Gregory S. Chirikjian,et al.  Almost-uniform sampling of rotations for conformational searches in Robotics and Structural Biology , 2012, 2012 IEEE International Conference on Robotics and Automation.

[11]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[12]  Gregory D. Hager,et al.  Hierarchical semantic parsing for object pose estimation in densely cluttered scenes , 2016, 2016 IEEE International Conference on Robotics and Automation (ICRA).

[13]  V. Lepetit,et al.  EPnP: An Accurate O(n) Solution to the PnP Problem , 2009, International Journal of Computer Vision.

[14]  Nassir Navab,et al.  Incremental scene understanding on dense SLAM , 2016, 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[15]  Paul H. J. Kelly,et al.  SLAM++: Simultaneous Localisation and Mapping at the Level of Objects , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[16]  Tae-Kyun Kim,et al.  Recovering 6D Object Pose and Predicting Next-Best-View in the Crowd , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Ross B. Girshick,et al.  Mask R-CNN , 2017, 1703.06870.

[18]  Federico Tombari,et al.  CNN-SLAM: Real-Time Dense Monocular SLAM with Learned Depth Prediction , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Yi Zhang,et al.  UnrealCV: Virtual Worlds for Computer Vision , 2017, ACM Multimedia.

[20]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[21]  Ross B. Girshick,et al.  Fast R-CNN , 2015, 1504.08083.

[22]  Jana Kosecka,et al.  3D Bounding Box Estimation Using Deep Learning and Geometry , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Mathieu Aubry,et al.  Crafting a multi-task CNN for viewpoint estimation , 2016, BMVC.

[24]  Silvio Savarese,et al.  Beyond PASCAL: A benchmark for 3D object detection in the wild , 2014, IEEE Winter Conference on Applications of Computer Vision.

[25]  Tae-Kyun Kim,et al.  Latent-Class Hough Forests for 3D Object Detection and Pose Estimation , 2014, ECCV.

[26]  Nassir Navab,et al.  SSD-6D: Making RGB-Based 3D Detection and 6D Pose Estimation Great Again , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[27]  Dieter Fox,et al.  PoseCNN: A Convolutional Neural Network for 6D Object Pose Estimation in Cluttered Scenes , 2017, Robotics: Science and Systems.

[28]  Tae-Kyun Kim,et al.  Pose Guided RGBD Feature Learning for 3D Object Pose Estimation , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[29]  John J. Leonard,et al.  Monocular SLAM Supported Object Recognition , 2015, Robotics: Science and Systems.

[30]  Vincent Lepetit,et al.  BB8: A Scalable, Accurate, Robust to Partial Occlusion Method for Predicting the 3D Poses of Challenging Objects without Using Depth , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[31]  Eric Brachmann,et al.  Learning Analysis-by-Synthesis for 6D Pose Estimation in RGB-D Images , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[32]  Federico Tombari,et al.  A combined texture-shape descriptor for enhanced 3D feature matching , 2011, 2011 18th IEEE International Conference on Image Processing.

[33]  Eric Brachmann,et al.  Uncertainty-Driven 6D Pose Estimation of Objects and Scenes from a Single RGB Image , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Nassir Navab,et al.  Deep Learning of Local RGB-D Patches for 3D Object Detection and 6D Pose Estimation , 2016, ECCV.

[35]  Dieter Fox,et al.  Detection-based object labeling in 3D scenes , 2012, 2012 IEEE International Conference on Robotics and Automation.

[36]  Pascal Fua,et al.  Real-Time Seamless Single Shot 6D Object Pose Prediction , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[37]  Vincent Lepetit,et al.  Model Based Training, Detection and Pose Estimation of Texture-Less 3D Objects in Heavily Cluttered Scenes , 2012, ACCV.

[38]  Radu Bogdan Rusu,et al.  Semantic 3D Object Maps for Everyday Manipulation in Human Living Environments , 2010, KI - Künstliche Intelligenz.

[39]  Justus H. Piater,et al.  Integration of Probabilistic Pose Estimates from Multiple Views , 2016, ECCV.

[40]  Vincent Lepetit,et al.  Learning descriptors for object recognition and 3D pose estimation , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Kuan-Ting Yu,et al.  Multi-view self-supervised deep learning for 6D pose estimation in the Amazon Picking Challenge , 2016, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[42]  Eric Brachmann,et al.  Learning 6 D Object Pose Estimation using 3 D Object Coordinates-Supplementary Material - , 2014 .

[43]  Stefan Leutenegger,et al.  Pairwise Decomposition of Image Sequences for Active Multi-view Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  Yi Yang,et al.  Articulated pose estimation with flexible mixtures-of-parts , 2011, CVPR 2011.

[45]  Sergey Levine,et al.  Learning Hand-Eye Coordination for Robotic Grasping with Large-Scale Data Collection , 2016, ISER.

[46]  Darius Burschka,et al.  An Efficient RANSAC for 3D Object Recognition in Noisy and Occluded Scenes , 2010, ACCV.

[47]  Andrew W. Fitzgibbon,et al.  KinectFusion: real-time 3D reconstruction and interaction using a moving depth camera , 2011, UIST.

[48]  Eric Brachmann,et al.  Learning 6D Object Pose Estimation Using 3D Object Coordinates , 2014, ECCV.