论文信息 - Hallucinating Pose-Compatible Scenes

Hallucinating Pose-Compatible Scenes

What does human pose tell us about a scene? We propose a task to answer this question: given human pose as input, hallucinate a compatible scene. Subtle cues captured by human pose — action semantics, environment affordances, object interactions — provide surprising insight into which scenes are compatible. We present a large-scale generative adversarial network for pose-conditioned scene generation. We significantly scale the size and complexity of training data, curating a massive meta-dataset containing over 19 million frames of humans in everyday environments. We double the capacity of our model with respect to StyleGAN2 to handle such complex data, and design a pose conditioning mechanism that drives our model to learn the nuanced relationship between pose and scene. We leverage our trained model for various applications: hallucinating pose-compatible scene(s) with or without humans, visualizing incompatible scenes and poses, placing a person from one generated image into another scene, and animating pose. Our model produces diverse samples and outperforms pose-conditioned StyleGAN2 and Pix2Pix baselines in terms of accurate human placement (percent of correct keypoints) and image quality (Fréchet inception distance).

Alexei A. Efros | Tim Brooks

[1] Luc Van Gool,et al. Large Scale Holistic Video Understanding , 2019, ECCV.

[2] Yaser Sheikh,et al. OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[3] Joachim Tesch,et al. Populating 3D Scenes by Learning Human-Scene Interaction , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[4] Luc Van Gool,et al. What makes a chair a chair? , 2011, CVPR 2011.

[5] Li Fei-Fei,et al. ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[6] Alexei A. Efros,et al. Swapping Autoencoder for Deep Image Manipulation , 2020, NeurIPS.

[7] Andrew Zisserman,et al. A Short Note on the Kinetics-700 Human Action Dataset , 2019, ArXiv.

[8] Alexei A. Efros,et al. Everybody Dance Now , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[9] Sepp Hochreiter,et al. GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium , 2017, NIPS.

[10] Minh Vo,et al. Long-term Human Motion Prediction with Scene Context , 2020, ECCV.

[11] Phillip Isola,et al. Using latent space regression to analyze and leverage compositionality in GANs , 2021, ICLR.

[12] Geoffrey E. Hinton,et al. A Learning Algorithm for Boltzmann Machines , 1985, Cogn. Sci..

[13] Jaakko Lehtinen,et al. Progressive Growing of GANs for Improved Quality, Stability, and Variation , 2017, ICLR.

[14] Dumitru Erhan,et al. Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15] Sanja Fidler,et al. The Role of Context for Object Detection and Semantic Segmentation in the Wild , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[16] Marco Marchesi,et al. Megapixel Size Image Creation using Generative Adversarial Networks , 2017, ArXiv.

[17] Jaakko Lehtinen,et al. Analyzing and Improving the Image Quality of StyleGAN , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[18] Sanja Fidler,et al. Learning to Act Properly: Predicting and Explaining Affordances from Images , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[19] Nicu Sebe,et al. Deformable GANs for Pose-Based Human Image Generation , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[20] C. Duchon. Lanczos Filtering in One and Two Dimensions , 1979 .

[21] Abhinav Gupta,et al. Binge Watching: Scaling Affordance Learning from Sitcoms , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22] Abhinav Gupta,et al. In Defense of the Direct Perception of Affordances , 2015, ArXiv.

[23] Timo Aila,et al. A Style-Based Generator Architecture for Generative Adversarial Networks , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[24] Fei-Fei Li,et al. Modeling mutual context of object and human pose in human-object interaction activities , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[25] Peter Wonka,et al. Image2StyleGAN: How to Embed Images Into the StyleGAN Latent Space? , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[26] Antonio Torralba,et al. Context-based vision system for place and object recognition , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[27] Bernt Schiele,et al. Generative Adversarial Text to Image Synthesis , 2016, ICML.

[28] Aude Oliva,et al. GANalyze: Toward Visual Definitions of Cognitive Image Properties , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[29] Jan Kautz,et al. Putting Humans in a Scene: Learning Affordance in 3D Indoor Environments , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[30] Yong Jae Lee,et al. MixNMatch: Multifactor Disentanglement and Encoding for Conditional Image Generation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[31] Cristian Sminchisescu,et al. Human3.6M: Large Scale Datasets and Predictive Methods for 3D Human Sensing in Natural Environments , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[32] Xiaogang Wang,et al. Deep Learning Face Attributes in the Wild , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[33] Irving Biederman,et al. On the Semantics of a Glance at a Scene , 2017 .

[34] Yaser Sheikh,et al. OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[35] Bolei Zhou,et al. Moments in Time Dataset: One Million Videos for Event Understanding , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[36] Kaiming He,et al. Detecting and Recognizing Human-Object Interactions , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[37] Geoffrey E. Hinton,et al. Visualizing Data using t-SNE , 2008 .

[38] Tero Karras,et al. Training Generative Adversarial Networks with Limited Data , 2020, NeurIPS.

[39] Yinda Zhang,et al. LSUN: Construction of a Large-scale Image Dataset using Deep Learning with Humans in the Loop , 2015, ArXiv.

[40] Yann Dauphin,et al. Hierarchical Neural Story Generation , 2018, ACL.

[41] Ilija Radosavovic,et al. Reconstructing Hand-Object Interactions in the Wild , 2020, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[42] Chen Huang,et al. Dense Intrinsic Appearance Flow for Human Pose Transfer , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[43] Jingwei Xu,et al. Synthesizing Long-Term 3D Human Motion and Interaction in 3D Scenes , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[44] Phillip Isola,et al. On the "steerability" of generative adversarial networks , 2019, ICLR.

[45] Li Fei-Fei,et al. Reasoning about Object Affordances in a Knowledge Base Representation , 2014, ECCV.

[46] Ross B. Girshick,et al. Mask R-CNN , 2017, 1703.06870.

[47] Ali Farhadi,et al. Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding , 2016, ECCV.

[48] Li Fei-Fei,et al. Perceptual Losses for Real-Time Style Transfer and Super-Resolution , 2016, ECCV.

[49] Andrew Owens,et al. CNN-Generated Images Are Surprisingly Easy to Spot… for Now , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[50] Alexei A. Efros,et al. People Watching: Human Actions as a Cue for Single View Geometry , 2012, International Journal of Computer Vision.

[51] Jitendra Malik,et al. From Lifestyle Vlogs to Everyday Interactions , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[52] Andrea Vedaldi,et al. Objects in Context , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[53] Antonio Torralba,et al. The Hessian Penalty: A Weak Prior for Unsupervised Disentanglement , 2020, ECCV.

[54] Yejin Choi,et al. The Curious Case of Neural Text Degeneration , 2019, ICLR.

[55] Ning Xu,et al. YouTube-VOS: Sequence-to-Sequence Video Object Segmentation , 2018, ECCV.

[56] Alexei A. Efros,et al. An empirical study of context in object detection , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[57] Simon Osindero,et al. Conditional Generative Adversarial Nets , 2014, ArXiv.

[58] Bolei Zhou,et al. GAN Dissection: Visualizing and Understanding Generative Adversarial Networks , 2018, ICLR.

[59] Frédo Durand,et al. Synthesizing Images of Humans in Unseen Poses , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[60] Jeff Donahue,et al. Large Scale GAN Training for High Fidelity Natural Image Synthesis , 2018, ICLR.

[61] Bernt Schiele,et al. 2D Human Pose Estimation: New Benchmark and State of the Art Analysis , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[62] Fabio Viola,et al. The Kinetics Human Action Video Dataset , 2017, ArXiv.

[63] J. Gibson. The Ecological Approach to Visual Perception , 1979 .

[64] Luc Van Gool,et al. Pose Guided Person Image Generation , 2017, NIPS.

[65] Weiyu Zhang,et al. From Actemes to Action: A Strongly-Supervised Representation for Detailed Action Understanding , 2013, 2013 IEEE International Conference on Computer Vision.

[66] Luc Van Gool,et al. Disentangled Person Image Generation , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[67] Jitendra Malik,et al. Learning 3D Human Dynamics From Video , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[68] Song Han,et al. Differentiable Augmentation for Data-Efficient GAN Training , 2020, NeurIPS.

[69] Boyuan Chen,et al. Oops! Predicting Unintentional Action in Video , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[70] Alexei A. Efros,et al. Image-to-Image Translation with Conditional Adversarial Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[71] Hema Swetha Koppula,et al. Learning human activities and object affordances from RGB-D videos , 2012, Int. J. Robotics Res..

[72] Jaakko Lehtinen,et al. GANSpace: Discovering Interpretable GAN Controls , 2020, NeurIPS.

[73] Yun Jiang,et al. Hallucinated Humans as the Hidden Context for Labeling 3D Scenes , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[74] Dani Lischinski,et al. Deep Video‐Based Performance Cloning , 2018, Comput. Graph. Forum.

[75] Prafulla Dhariwal,et al. Glow: Generative Flow with Invertible 1x1 Convolutions , 2018, NeurIPS.

[76] Sebastian Nowozin,et al. Which Training Methods for GANs do actually Converge? , 2018, ICML.

[77] Jessica K. Hodgins,et al. Interactive control of avatars animated with human motion data , 2002, SIGGRAPH.

[78] Alexei A. Efros,et al. From 3D scene geometry to human workspace , 2011, CVPR 2011.

[79] Alexei A. Efros,et al. Scene Semantics from Long-Term Observation of People , 2012, ECCV.

[80] Alexei A. Efros,et al. The Unreasonable Effectiveness of Deep Features as a Perceptual Metric , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.