Limitations of Metric Loss for the Estimation of Joint Translation and Rotation

Localizing objects is a key challenge for robotics, augmented reality and mixed reality applications. Images taken in the real world feature many objects with challenging factors such as occlusions, motion blur and changing lights. In manufacturing industry scenes, a large majority of objects are poorly textured or highly reflective. Moreover, they often present symmetries which makes the localization task even more complicated. PoseNet is a deep neural network based on GoogleNet that predicts camera poses in indoor room and outdoor streets. We propose to evaluate this method for the problem of industrial object pose estimation by training the network on the T-LESS dataset. We demonstrate with our experiments that PoseNet is able to predict translation and rotation separately with high accuracy. However, our experiments also prove that it is not able to learn translation and rotation jointly. Indeed, one of the two modalities is either not learned by the network, or forgotten during training when the other is being learned. This justifies the fact that future works will require other formulation of the loss as well as other architectures in order to solve the pose estimation general problem.

[1]  I. Guyon,et al.  Handwritten digit recognition: applications of neural network chips and automatic learning , 1989, IEEE Communications Magazine.

[2]  Dieter Fox,et al.  PoseCNN: A Convolutional Neural Network for 6D Object Pose Estimation in Cluttered Scenes , 2017, Robotics: Science and Systems.

[3]  Jean-Yves Didier,et al.  AMRA: Augmented Reality Assistance for Train Maintenance Tasks , 2005 .

[4]  Manolis I. A. Lourakis,et al.  T-LESS: An RGB-D Dataset for 6D Pose Estimation of Texture-Less Objects , 2017, 2017 IEEE Winter Conference on Applications of Computer Vision (WACV).

[5]  Thomas Brox,et al.  U-Net: Convolutional Networks for Biomedical Image Segmentation , 2015, MICCAI.

[6]  Gabriele Peters,et al.  The structure-from-motion reconstruction pipeline - a survey with focus on short image sequences , 2010, Kybernetika.

[7]  Ian D. Reid,et al.  Deep-6DPose: Recovering 6D Object Pose from a Single RGB Image , 2018, ArXiv.

[8]  David G. Lowe,et al.  Object recognition from local scale-invariant features , 1999, Proceedings of the Seventh IEEE International Conference on Computer Vision.

[9]  Vincent Lepetit,et al.  BB8: A Scalable, Accurate, Robust to Partial Occlusion Method for Predicting the 3D Poses of Challenging Objects without Using Depth , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[10]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[12]  Yi Li,et al.  DeepIM: Deep Iterative Matching for 6D Pose Estimation , 2018, International Journal of Computer Vision.

[13]  Roberto Cipolla,et al.  PoseNet: A Convolutional Network for Real-Time 6-DOF Camera Relocalization , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[14]  Geoffrey E. Hinton,et al.  Transforming Auto-Encoders , 2011, ICANN.

[15]  Vincent Lepetit,et al.  A Novel Representation of Parts for Accurate 3D Object Detection and Tracking in Monocular Images , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[16]  James L. Crowley,et al.  Defining the Pose of Any 3D Rigid Object and an Associated Distance , 2016, International Journal of Computer Vision.

[17]  Luc Van Gool,et al.  SURF: Speeded Up Robust Features , 2006, ECCV.

[18]  Geoffrey E. Hinton,et al.  Dynamic Routing Between Capsules , 2017, NIPS.

[19]  Daniel Pizarro-Perez,et al.  Computer-Assisted Laparoscopic myomectomy by augmenting the uterus with pre-operative MRI data , 2014, 2014 IEEE International Symposium on Mixed and Augmented Reality (ISMAR).

[20]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[21]  Andreas Geiger,et al.  Are we ready for autonomous driving? The KITTI vision benchmark suite , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.