论文信息 - Bilevel Online Adaptation for Out-of-Domain Human Mesh Reconstruction

Bilevel Online Adaptation for Out-of-Domain Human Mesh Reconstruction

This paper considers a new problem of adapting a pretrained model of human mesh reconstruction to out-of-domain streaming videos. However, most previous methods based on the parametric SMPL model [36] underperform in new domains with unexpected, domain-specific attributes, such as camera parameters, lengths of bones, backgrounds, and occlusions. Our general idea is to dynamically fine-tune the source model on test video streams with additional temporal constraints, such that it can mitigate the domain gaps without over-fitting the 2D information of individual test frames. A subsequent challenge is how to avoid conflicts between the 2D and temporal constraints. We propose to tackle this problem using a new training algorithm named Bilevel Online Adaptation (BOA), which divides the optimization process of overall multi-objective into two steps of weight probe and weight update in a training iteration. We demonstrate that BOA leads to state-of-the-art results on two human mesh reconstruction benchmarks1.

[1] Dimitrios Tzionas,et al. Expressive Body Capture: 3D Hands, Face, and Body From a Single Image , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[2] Jie Song,et al. Human Body Model Fitting by Learned Gradient Descent , 2020, ECCV.

[3] Michael J. Black,et al. SMPL: A Skinned Multi-Person Linear Model , 2023 .

[4] Jitendra Malik,et al. Learning 3D Human Dynamics From Video , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[5] Jiashi Feng,et al. Inference Stage Optimization for Cross-scenario 3D Human Pose Estimation , 2020, NeurIPS.

[6] Nicu Sebe,et al. Online Depth Learning Against Forgetting in Monocular Videos , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[7] Pascal Fua,et al. Monocular 3D Human Pose Estimation in the Wild Using Improved CNN Supervision , 2016, 2017 International Conference on 3D Vision (3DV).

[8] Kaiming He,et al. Group Normalization , 2018, ECCV.

[9] Andrew Zisserman,et al. Sim2real transfer learning for 3D human pose estimation: motion to the rescue , 2019, NeurIPS.

[10] Kyoung Mu Lee,et al. Pose2Mesh: Graph Convolutional Network for 3D Human Pose and Mesh Recovery from a 2D Human Pose , 2020, ECCV.

[11] Andrew Zisserman,et al. Exploiting Temporal Context for 3D Human Pose Estimation in the Wild , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[12] Trevor Darrell,et al. Adapting to Continuously Shifting Domains , 2018, ICLR.

[13] Sergey Ioffe,et al. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[14] Ambrish Tyagi,et al. PoseNet3D: Learning Temporally Consistent 3D Human Pose via Knowledge Distillation , 2020, 2020 International Conference on 3D Vision (3DV).

[15] Peter V. Gehler,et al. Unite the People: Closing the Loop Between 3D and 2D Human Representations , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16] Yongxin Yang,et al. Episodic Training for Domain Generalization , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[17] Bernhard Schölkopf,et al. Domain Generalization via Invariant Feature Representation , 2013, ICML.

[18] Xiaowei Zhou,et al. Learning to Estimate 3D Human Pose and Shape from a Single Color Image , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[19] Mengjie Zhang,et al. Domain Generalization for Object Recognition with Multi-task Autoencoders , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[20] Michael J. Black,et al. Combined discriminative and generative articulated pose and non-rigid shape estimation , 2007, NIPS.

[21] Silvio Savarese,et al. Generalizing to Unseen Domains via Adversarial Data Augmentation , 2018, NeurIPS.

[22] Konrad Schindler,et al. Chained Representation Cycling: Learning to Estimate 3D Human Pose and Shape by Cycling Between Representations , 2020, AAAI.

[23] Luigi di Stefano,et al. Real-Time Self-Adaptive Deep Stereo , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[24] Michael J. Black,et al. Learning to Reconstruct 3D Human Pose and Shape via Model-Fitting in the Loop , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[25] Siddhartha Chaudhuri,et al. Generalizing Across Domains via Cross-Gradient Training , 2018, ICLR.

[26] Kyoung Mu Lee,et al. I2L-MeshNet: Image-to-Lixel Prediction Network for Accurate 3D Human Pose and Mesh Estimation from a Single RGB Image , 2020, ECCV.

[27] Alex ChiChung Kot,et al. Domain Generalization with Adversarial Feature Learning , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[28] Harri Valpola,et al. Weight-averaged consistency targets improve semi-supervised deep learning results , 2017, ArXiv.

[29] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[30] Yinghao Huang,et al. Towards Accurate Marker-Less Human Shape and Pose Estimation over Time , 2017, 2017 International Conference on 3D Vision (3DV).

[31] Michael J. Black,et al. Estimating human shape and pose from a single image , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[32] Alexei A. Efros,et al. Undoing the Damage of Dataset Bias , 2012, ECCV.

[33] Luigi di Stefano,et al. Learning to Adapt for Stereo , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[34] Andre Wibisono,et al. Streaming Variational Bayes , 2013, NIPS.

[35] Peter V. Gehler,et al. Keep It SMPL: Automatic Estimation of 3D Human Pose and Shape from a Single Image , 2016, ECCV.

[36] Pamela J. Hinds,et al. Challenges to grounding in human-robot interaction , 2006, HRI '06.

[37] Bodo Rosenhahn,et al. Supplementary Material to: Recovering Accurate 3D Human Pose in The Wild Using IMUs and a Moving Camera , 2018 .

[38] Yongxin Yang,et al. Deeper, Broader and Artier Domain Generalization , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[39] Cristian Sminchisescu,et al. Monocular 3D Pose and Shape Estimation of Multiple People in Natural Scenes: The Importance of Multiple Scene Constraints , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[40] Sergey Levine,et al. Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks , 2017, ICML.

[41] Cristian Sminchisescu,et al. Human3.6M: Large Scale Datasets and Predictive Methods for 3D Human Sensing in Natural Environments , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[42] Liu Wu,et al. Human Mesh Recovery From Monocular Images via a Skeleton-Disentangled Representation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[43] Alexander C. Berg,et al. Meta-Tracker: Fast and Robust Online Adaptation for Visual Object Trackers , 2018, ECCV.

[44] Yongxin Yang,et al. Learning to Generalize: Meta-Learning for Domain Generalization , 2017, AAAI.

[45] Charless C. Fowlkes,et al. Predicting Camera Viewpoint Improves Cross-dataset Generalization for 3D Human Pose Estimation , 2020, ECCV Workshops.

[46] Iasonas Kokkinos,et al. HoloPose: Holistic 3D Human Reconstruction In-The-Wild , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[47] Ersin Yumer,et al. Self-supervised Learning of Motion Capture , 2017, NIPS.

[48] Kostas Daniilidis,et al. Convolutional Mesh Regression for Single-Image Human Shape Reconstruction , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[49] Peter V. Gehler,et al. Neural Body Fitting: Unifying Deep Learning and Model Based Human Pose and Shape Estimation , 2018, 2018 International Conference on 3D Vision (3DV).

[50] Fei-Fei Li,et al. ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[51] Michael J. Black,et al. VIBE: Video Inference for Human Body Pose and Shape Estimation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[52] Otmar Hilliges,et al. Structured Prediction Helps 3D Human Motion Modelling , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[53] Aryan Mokhtari,et al. On the Convergence Theory of Gradient-Based Model-Agnostic Meta-Learning Algorithms , 2019, AISTATS.

[54] Hongbin Zha,et al. Self-Supervised Deep Visual Odometry With Online Adaptation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[55] Li Fei-Fei,et al. ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[56] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[57] Jitendra Malik,et al. End-to-End Recovery of Human Shape and Pose , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[58] G. A. Giraldi,et al. Introduction to Augmented Reality , 2003 .

[59] Nitish Srivastava,et al. Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[60] Song-Chun Zhu,et al. DenseRaC: Joint 3D Pose and Shape Estimation by Dense Render-and-Compare , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[61] Ziyan Wu,et al. Towards Robust RGB-D Human Mesh Recovery , 2019, ArXiv.

[62] Andrea Vedaldi,et al. Exemplar Fine-Tuning for 3D Human Model Fitting Towards In-the-Wild 3D Human Pose Estimation , 2020, 2021 International Conference on 3D Vision (3DV).

[63] Bastian Leibe,et al. Online Adaptation of Convolutional Neural Networks for Video Object Segmentation , 2017, BMVC.

[64] Yu Wang,et al. Learning to Adapt to Evolving Domains , 2020, NeurIPS.

[65] Yoram Singer,et al. Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..