Object Pose Estimation from Monocular Image using Multi-View Keypoint Correspondence

Understanding the geometry and pose of objects in 2D images is a fundamental necessity for a wide range of real world applications. Driven by deep neural networks, recent methods have brought significant improvements to object pose estimation. However, they suffer due to scarcity of keypoint/pose-annotated real images and hence can not exploit the object’s 3D structural information effectively. In this work, we propose a data-efficient method which utilizes the geometric regularity of intraclass objects for pose estimation. First, we learn pose-invariant local descriptors of object parts from simple 2D RGB images. These descriptors, along with keypoints obtained from renders of a fixed 3D template model are then used to generate keypoint correspondence maps for a given monocular real image. Finally, a pose estimation network predicts 3D pose of the object using these correspondence maps. This pipeline is further extended to a multi-view approach, which assimilates keypoint information from correspondence sets generated from multiple views of the 3D template model. Fusion of multi-view information significantly improves geometric comprehension of the system which in turn enhances the pose estimation performance. Furthermore, use of correspondence framework responsible for the learning of pose invariant keypoint descriptor also allows us to effectively alleviate the data-scarcity problem. This enables our method to achieve state-of-the-art performance on multiple real-image viewpoint estimation datasets, such as Pascal3D+ and ObjectNet3D. To encourage reproducible research, we have released the codes for our proposed approach (Code: https://github.com/val-iisc/pose_estimation).

[1]  Leonidas J. Guibas,et al.  Volumetric and Multi-view CNNs for Object Classification on 3D Data , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Luc Van Gool,et al.  The Pascal Visual Object Classes Challenge: A Retrospective , 2014, International Journal of Computer Vision.

[3]  Song Bai,et al.  Triplet-Center Loss for Multi-view 3D Object Retrieval , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[4]  Silvio Savarese,et al.  Beyond PASCAL: A benchmark for 3D object detection in the wild , 2014, IEEE Winter Conference on Applications of Computer Vision.

[5]  Leonidas J. Guibas,et al.  Render for CNN: Viewpoint Estimation in Images Using CNNs Trained with Rendered 3D Model Views , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[6]  Dieter Fox,et al.  Self-Supervised Visual Descriptor Learning for Dense Correspondence , 2017, IEEE Robotics and Automation Letters.

[7]  Ersin Yumer,et al.  Learning Local Shape Descriptors from Part Correspondences with Multiview Convolutional Networks , 2017, ACM Trans. Graph..

[8]  Jitendra Malik,et al.  Multi-view Consistency as Supervisory Signal for Learning Shape and Pose Prediction , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[9]  Jitendra Malik,et al.  Viewpoints and keypoints , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Jitendra Malik,et al.  Shape matching and object recognition using low distortion correspondences , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[11]  René Vidal,et al.  3D Pose Regression Using Convolutional Neural Networks , 2017, 2017 IEEE International Conference on Computer Vision Workshops (ICCVW).

[12]  Vincent Lepetit,et al.  3D Pose Estimation and 3D Model Retrieval for Objects in the Wild , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[13]  Alexei A. Efros,et al.  Seeing 3D Chairs: Exemplar Part-Based 2D-3D Alignment Using a Large Dataset of CAD Models , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[14]  Jana Kosecka,et al.  Fast Single Shot Detection and Pose Estimation , 2016, 2016 Fourth International Conference on 3D Vision (3DV).

[15]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Silvio Savarese,et al.  Universal Correspondence Network , 2016, NIPS.

[17]  Pascal Fua,et al.  Learning Monocular 3D Human Pose Estimation from Multi-view Images , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[18]  Hongxun Yao,et al.  Hierarchical semantic image matching using CNN feature pyramid , 2018, Comput. Vis. Image Underst..

[19]  Yuandong Tian,et al.  Single Image 3D Interpreter Network , 2016, ECCV.

[20]  Yoichi Sato,et al.  Joint Recovery of Dense Correspondence and Cosegmentation in Two Images , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Jonathan T. Barron,et al.  Multiscale Combinatorial Grouping , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[22]  Lucas Paletta,et al.  Appearance-based active object recognition , 2000, Image Vis. Comput..

[23]  Leonidas J. Guibas,et al.  ObjectNet3D: A Large Scale Database for 3D Object Recognition , 2016, ECCV.

[24]  Yasuyuki Matsushita,et al.  RotationNet: Joint Object Categorization and Pose Estimation Using Multiviews from Unsupervised Viewpoints , 2016, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[25]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[26]  R. Venkatesh Babu,et al.  AdaDepth: Unsupervised Content Congruent Adaptation for Depth Estimation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[27]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[28]  Jean Ponce,et al.  SCNet: Learning Semantic Correspondence , 2017, ICCV.

[29]  Antonio Torralba,et al.  SIFT Flow: Dense Correspondence across Scenes and Its Applications , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[30]  Trevor Darrell,et al.  Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[31]  Lucas Paletta,et al.  Active object recognition by view integration and reinforcement learning , 2000, Robotics Auton. Syst..

[32]  Subhransu Maji,et al.  Multi-view Convolutional Neural Networks for 3D Shape Recognition , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[33]  Matthijs C. Dorst Distinctive Image Features from Scale-Invariant Keypoints , 2011 .