MORE: Simultaneous Multi-View 3D Object Recognition and Pose Estimation

Simultaneous object recognition and pose estimation are two key functionalities for robots to safely interact with humans as well as environments. Although both object recognition and pose estimation use visual input, most state-of-the-art tackles them as two separate problems since the former needs a view-invariant representation while object pose estimation necessitates a view-dependent description. Nowadays, multi-view Convolutional Neural Network (MVCNN) approaches show stateof-the-art classification performance. Although MVCNN object recognition has been widely explored, there has been very little research on multi-view object pose estimation methods, and even less on addressing these two problems simultaneously. The pose of virtual cameras in MVCNN methods is often pre-defined in advance, leading to bound the application of such approaches. In this paper, we propose an approach capable of handling object recognition and pose estimation simultaneously. In particular, we develop a deep object-agnostic entropy estimation model, capable of predicting the best viewpoints of a given 3D object. The obtained views of the object are then fed to the network to simultaneously predict the pose and category label of the target object. Experimental results showed that the views obtained from such positions are descriptive enough to achieve good accuracy score. Code is available online at: https://github.com/tparisotto/more_mvcnn

[1]  Jiajun Wu,et al.  Learning a Probabilistic Latent Space of Object Shapes via 3D Generative-Adversarial Modeling , 2016, NIPS.

[2]  Mateu Sbert,et al.  Viewpoint Selection using Viewpoint Entropy , 2001, VMV.

[3]  Christopher Hunt,et al.  Notes on the OpenSURF Library , 2009 .

[4]  Leonidas J. Guibas,et al.  FPNN: Field Probing Neural Networks for 3D Data , 2016, NIPS.

[5]  Ahmed M. Elgammal,et al.  Untangling Object-View Manifold for Multiview Recognition and Pose Estimation , 2014, ECCV.

[6]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[7]  Silvio Savarese,et al.  3D generic object categorization, localization and pose estimation , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[8]  Quoc V. Le,et al.  EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks , 2019, ICML.

[9]  Bo Chen,et al.  MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications , 2017, ArXiv.

[10]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Subhransu Maji,et al.  Multi-view Convolutional Neural Networks for 3D Shape Recognition , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[12]  Ameet Talwalkar,et al.  Hyperband: A Novel Bandit-Based Approach to Hyperparameter Optimization , 2016, J. Mach. Learn. Res..

[13]  Mongi A. Abidi,et al.  Shape analysis algorithm based on information theory , 2003, Proceedings 2003 International Conference on Image Processing (Cat. No.03CH37429).

[14]  Yasuyuki Matsushita,et al.  RotationNet: Joint Object Categorization and Pose Estimation Using Multiviews from Unsupervised Viewpoints , 2016, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[15]  Mark Sandler,et al.  MobileNetV2: Inverted Residuals and Linear Bottlenecks , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[16]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[17]  Dieter Fox,et al.  A Scalable Tree-Based Approach for Joint Object and Pose Recognition , 2011, AAAI.

[18]  John F. Canny,et al.  A Computational Approach to Edge Detection , 1986, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[19]  David G. Lowe,et al.  Object recognition from local scale-invariant features , 1999, Proceedings of the Seventh IEEE International Conference on Computer Vision.

[20]  Sebastian Scherer,et al.  VoxNet: A 3D Convolutional Neural Network for real-time object recognition , 2015, 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[21]  Jianxiong Xiao,et al.  3D ShapeNets: A deep representation for volumetric shapes , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Tae-Kyun Kim,et al.  Perceiving, Learning, and Recognizing 3D Objects: An Approach to Cognitive Service Robots , 2018, AAAI.

[23]  Ahmed M. Elgammal,et al.  A Comparative Analysis and Study of Multiview CNN Models for Joint Object Categorization and Pose Estimation , 2016, ICML.

[24]  Hamidreza Kasaei,et al.  OrthographicNet: A Deep Learning Approach for 3D Object Recognition in Open-Ended Domains , 2019, ArXiv.

[25]  Tae-Kyun Kim,et al.  Multi-view 6D Object Pose Estimation and Camera Motion Planning Using RGBD Images , 2017, 2017 IEEE International Conference on Computer Vision Workshops (ICCVW).

[26]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[27]  Leonidas J. Guibas,et al.  Volumetric and Multi-view CNNs for Object Classification on 3D Data , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Bodo Rosenhahn,et al.  Exploiting View-Specific Appearance Similarities Across Classes for Zero-Shot Pose Prediction: A Metric Learning Approach , 2016, AAAI.

[29]  Ahmed M. Elgammal,et al.  Joint Object and Pose Recognition Using Homeomorphic Manifold Analysis , 2013, AAAI.

[30]  David W. Jacobs,et al.  Mesh saliency , 2005, ACM Trans. Graph..

[31]  Afzal Godil,et al.  A benchmark for best view selection of 3D objects , 2010, 3DOR '10.

[32]  Vladlen Koltun,et al.  Open3D: A Modern Library for 3D Data Processing , 2018, ArXiv.