End-to-End Learning of Multi-category 3D Pose and Shape Estimation

In this paper, we study the representation of the shape and pose of objects using their keypoints. Therefore, we propose an end-to-end method that simultaneously detects 2D keypoints from an image and lifts them to 3D. The proposed method learns both 2D detection and 3D lifting only from 2D keypoints annotations. In this regard, a novel method that explicitly disentangles the pose and 3D shape by means of augmentation-based cyclic self-supervision is proposed, for the first time. In addition of being end-to-end in image to 3D learning, our method also handles objects from multiple categories using a single neural network. We use a Transformer-based architecture to detect the keypoints, as well as to summarize the visual context of the image. This visual context information is then used while lifting the keypoints to 3D, so as to allow the context-based reasoning for better performance. While lifting, our method learns a small set of basis shapes and their sparse non-negative coefficients to represent the 3D shape in canonical frame. Our method can handle occlusions as well as wide variety of object classes. Our experiments on three benchmarks demonstrate that our method performs better than the stateof-the-art. Our source code will be made publicly available.

[1]  Ross B. Girshick,et al.  Mask R-CNN , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2]  Xin Yu,et al.  PR-RRN: Pairwise-Regularized Residual-Recursive Networks for Non-rigid Structure-from-Motion , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[3]  Simon Lucey,et al.  Complex Non-rigid Motion 3D Reconstruction by Union of Subspaces , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[4]  Unsupervised Learning of Probably Symmetric Deformable 3D Objects from Images in the Wild (Extended Abstract) , 2021, IJCAI.

[5]  Nicu Sebe,et al.  Cycle In Cycle Generative Adversarial Networks for Keypoint-Guided Image Generation , 2019, ACM Multimedia.

[6]  Jitendra Malik,et al.  Learning Category-Specific Mesh Reconstruction from Image Collections , 2018, ECCV.

[7]  Xiaowei Zhou,et al.  Sparseness Meets Deepness: 3D Human Pose Estimation from Monocular Video , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Andrew W. Fitzgibbon,et al.  Real-time human pose recognition in parts from single depth images , 2011, CVPR 2011.

[9]  Christian Szegedy,et al.  DeepPose: Human Pose Estimation via Deep Neural Networks , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[10]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Nojun Kwak,et al.  Procrustean Regression Networks: Learning 3D Structure of Non-Rigid Objects from 2D Annotations , 2020, ECCV.

[12]  Michael J. Black,et al.  SMPL: A Skinned Multi-Person Linear Model , 2023 .

[13]  Takeo Kanade,et al.  Trajectory Space: A Dual Representation for Nonrigid Structure from Motion , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[14]  Andrea Vedaldi,et al.  C3DPO: Canonical 3D Pose Networks for Non-Rigid Structure From Motion , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[15]  Pascal Fua,et al.  Local Non-Rigid Structure-From-Motion From Diffeomorphic Mappings , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Chen Kong,et al.  Prior-Less Compressible Structure from Motion , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Iasonas Kokkinos,et al.  Lifting AutoEncoders: Unsupervised Learning of a Fully-Disentangled 3D Morphable Model Using Deep Non-Rigid Structure From Motion , 2019, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).

[18]  Simon Lucey,et al.  Deep Non-Rigid Structure From Motion , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[19]  Peter V. Gehler,et al.  Keep It SMPL: Automatic Estimation of 3D Human Pose and Shape from a Single Image , 2016, ECCV.

[20]  I. Daubechies,et al.  An iterative thresholding algorithm for linear inverse problems with a sparsity constraint , 2003, math/0307152.

[21]  Lior Wolf,et al.  On Projection Matrices and their Applications in Computer Vision , 2001, ICCV.

[22]  Paolo Favaro,et al.  Self-Supervised Multi-View Synchronization Learning for 3D Pose Estimation , 2020, ACCV.

[23]  Fei Wang,et al.  3D Registration for Self-Occluded Objects in Context , 2020, ArXiv.

[24]  James M. Rehg,et al.  Unsupervised 3D Pose Estimation With Geometric Self-Supervision , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Richard Szeliski,et al.  Modeling the World from Internet Photo Collections , 2008, International Journal of Computer Vision.

[26]  Yoshua Bengio,et al.  Deep Sparse Rectifier Neural Networks , 2011, AISTATS.

[27]  Simon Lucey,et al.  PAUL: Procrustean Autoencoder for Unsupervised Lifting , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  George Trigeorgis,et al.  The 3D Menpo Facial Landmark Tracking Challenge , 2017, 2017 IEEE International Conference on Computer Vision Workshops (ICCVW).

[29]  Hongdong Li,et al.  A Simple Prior-Free Method for Non-rigid Structure-from-Motion Factorization , 2012, International Journal of Computer Vision.

[30]  Luca Carlone,et al.  Optimal Pose and Shape Estimation for Category-level 3D Object Perception , 2021, Robotics: Science and Systems.

[31]  Jitendra Malik,et al.  Viewpoints and keypoints , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Adrien Bartoli,et al.  Coarse-to-fine low-rank structure-from-motion , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[33]  Francesc Moreno-Noguer,et al.  3D Human Pose Estimation from a Single Image via Distance Matrix Regression , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Hongdong Li,et al.  UPnP: An Optimal O(n) Solution to the Absolute Pose Problem with Universal Applicability , 2014, ECCV.

[35]  Carsten Steger,et al.  Algorithms for the Orthographic-n-Point Problem , 2017, Journal of Mathematical Imaging and Vision.

[36]  Zi Jian Yew,et al.  3DFeat-Net: Weakly Supervised Local 3D Features for Point Cloud Registration , 2018, ECCV.

[37]  Aaron Hertzmann,et al.  Nonrigid Structure-from-Motion: Estimating Shape and Motion with Hierarchical Priors , 2008, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[38]  Nicolas Usunier,et al.  End-to-End Object Detection with Transformers , 2020, ECCV.

[39]  Olivier D. Faugeras,et al.  The fundamental matrix: Theory, algorithms, and stability analysis , 2004, International Journal of Computer Vision.

[40]  Torsten Sattler,et al.  Fast image-based localization using direct 2D-to-3D matching , 2011, 2011 International Conference on Computer Vision.

[41]  Geonho Cha,et al.  Unsupervised 3D Reconstruction Networks , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[42]  Yaser Sheikh,et al.  OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[43]  In Perfect Shape: Certifiably Optimal 3D Shape Reconstruction From 2D Landmarks , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  Yuri Odagiri,et al.  Unsupervised Adversarial Learning of 3D Human Pose from 2D Joint Locations , 2018, ArXiv.

[45]  Éric Marchand,et al.  Pose Estimation for Augmented Reality: A Hands-On Survey , 2016, IEEE Transactions on Visualization and Computer Graphics.

[46]  Chen Kong,et al.  Structure from Category: A Generic and Prior-Less Approach , 2016, 2016 Fourth International Conference on 3D Vision (3DV).

[47]  Zoltan-Csaba Marton,et al.  Augmented Autoencoders: Implicit 3D Orientation Learning for 6D Object Detection , 2019, International Journal of Computer Vision.

[48]  Jonathan Tompson,et al.  Discovery of Latent 3D Keypoints via End-to-end Geometric Reasoning , 2018, NeurIPS.

[49]  Minsik Lee,et al.  Procrustean Regression: A Flexible Alignment-Based Framework for Nonrigid Structure Estimation , 2018, IEEE Transactions on Image Processing.

[50]  Jianping Fan,et al.  Learning Deep Network for Detecting 3D Object Keypoints and 6D Poses , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[51]  Jitendra Malik,et al.  Grouping-Based Low-Rank Trajectory Completion and 3D Reconstruction , 2014, NIPS.

[52]  Yoshua. Bengio,et al.  Learning Deep Architectures for AI , 2007, Found. Trends Mach. Learn..

[53]  Henning Biermann,et al.  Recovering non-rigid 3D shape from image streams , 2000, Proceedings IEEE Conference on Computer Vision and Pattern Recognition. CVPR 2000 (Cat. No.PR00662).

[54]  Clara Fernandez-Labrador Indoor Scene Understanding using Non-Conventional Cameras. (Analyse de scènes intérieures à l'aide de caméras non conventionnelles) , 2020 .

[55]  Chong-Ho Choi,et al.  Procrustean Normal Distribution for Non-Rigid Structure from Motion , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[56]  Iasonas Kokkinos,et al.  To The Point: Correspondence-driven monocular 3D category reconstruction , 2021, NeurIPS.