Semantic translation with convolutional encoder-decoder networks for viewpoint estimation

Viewpoint estimation is an essential procedure in vision-based robotic manipulation. To address the scarcity of feature points on textureless objects, which hinders the generalization of classical methods, we proposed a new pipeline of viewpoint estimation, introducing semantic translation methods to highlight the structures of interest (SOIs) as foregrounds. In our method, a convolutional encoder-decoder network is applied as the generator of semantic segmentation, and we explore the adversarial training strategy with a conditional adversarial network as a discriminator to obtain finer details. We also contribute a dataset corresponding to the experiment, and perform viewpoint estimation based on the semantic outputs. Furthermore, we install our pipeline onto a robotic eye-in-hand system to complete a viewpoint transfer task. The experimental results show our proposed method (1) works on textureless objects for feature extraction,(2) is able to improve the semantic translation with adversarial training, and (3) has applicability for real robotic manipulation tasks.

[1]  Jun Yu,et al.  Pairwise Three-Dimensional Shape Context for Partial Object Matching and Retrieval on Mobile Laser Scanning Data , 2014, IEEE Geoscience and Remote Sensing Letters.

[2]  Jason J. Corso,et al.  Click Here: Human-Localized Keypoints as Guidance for Viewpoint Estimation , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[3]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Duc Thanh Nguyen,et al.  A Novel Chamfer Template Matching Method Using Variational Mean Field , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[5]  Antonio Torralba,et al.  FPM: Fine Pose Parts-Based Model with 3D CAD Models , 2014, ECCV.

[6]  Antonio Torralba,et al.  Parsing IKEA Objects: Fine Pose Estimation , 2013, 2013 IEEE International Conference on Computer Vision.

[7]  Trevor Darrell,et al.  Fully Convolutional Networks for Semantic Segmentation , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[8]  Vincent Lepetit,et al.  Model Based Training, Detection and Pose Estimation of Texture-Less 3D Objects in Heavily Cluttered Scenes , 2012, ACCV.

[9]  Alexei A. Efros,et al.  Seeing 3D Chairs: Exemplar Part-Based 2D-3D Alignment Using a Large Dataset of CAD Models , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[10]  Jitendra Malik,et al.  Viewpoints and keypoints , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Alexei A. Efros,et al.  Image-to-Image Translation with Conditional Adversarial Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Roberto Cipolla,et al.  SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[13]  Soumith Chintala,et al.  Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks , 2015, ICLR.

[14]  Yuandong Tian,et al.  Single Image 3D Interpreter Network , 2016, ECCV.

[15]  Ian D. Reid,et al.  Learning Depth from Single Monocular Images Using Deep Convolutional Neural Fields , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[16]  Fred Nicolls,et al.  Active shape models with SIFT descriptors and MARS , 2015, 2014 International Conference on Computer Vision Theory and Applications (VISAPP).

[17]  Larry S. Davis,et al.  Planar Structure Matching under Projective Uncertainty for Geolocation , 2014, ECCV.

[18]  Abhinav Gupta,et al.  Generative Image Modeling Using Style and Structure Adversarial Networks , 2016, ECCV.

[19]  Honglak Lee,et al.  Object Contour Detection with a Fully Convolutional Encoder-Decoder Network , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Leonidas J. Guibas,et al.  Render for CNN: Viewpoint Estimation in Images Using CNNs Trained with Rendered 3D Model Views , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[21]  Thomas Brox,et al.  Multi-view 3D Models from Single Images with a Convolutional Network , 2015, ECCV.

[22]  Silvio Savarese,et al.  Enriching object detection with 2D-3D registration and continuous viewpoint estimation , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Silvio Savarese,et al.  A coarse-to-fine model for 3D pose estimation and sub-category recognition , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Gustavo Carneiro,et al.  Unsupervised CNN for Single View Depth Estimation: Geometry to the Rescue , 2016, ECCV.

[25]  Thomas Brox,et al.  U-Net: Convolutional Networks for Biomedical Image Segmentation , 2015, MICCAI.

[26]  Kaiming He,et al.  Mask R-CNN , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[27]  Nanning Zheng,et al.  Contour Guided Hierarchical Model for Shape Matching , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).