HUMANISE: Language-conditioned Human Motion Generation in 3D Scenes

Learning to generate diverse scene-aware and goal-oriented human motions in 3D scenes remains challenging due to the mediocre characteristics of the existing datasets on Human-Scene Interaction (HSI); they only have limited scale/quality and lack semantics. To fill in the gap, we propose a large-scale and semantic-rich synthetic HSI dataset, denoted as HUMANISE, by aligning the captured human motion sequences with various 3D indoor scenes. We automatically annotate the aligned motions with language descriptions that depict the action and the unique interacting objects in the scene; e.g., sit on the armchair near the desk. HUMANISE thus enables a new generation task, language-conditioned human motion generation in 3D scenes. The proposed task is challenging as it requires joint modeling of the 3D scene, human motion, and natural language. To tackle this task, we present a novel scene-and-language conditioned generative model that can produce 3D human motions of the desirable action interacting with the specified objects. Our experiments demonstrate that our model generates diverse and semantically consistent human motions in 3D scenes.

[1]  Y. Li,et al.  Understanding Embodied Reference with Touch-Line Transformer , 2022, ICLR.

[2]  Michael J. Black,et al.  Capturing and Inferring Dense Full-Body Human-Scene Contact , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Michael J. Black,et al.  Human-Aware Object Placement for Visual Environment Reconstruction , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  M. Kawanabe,et al.  ScanQA: 3D Question Answering for Spatial Scene Understanding , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Dongdong Chen,et al.  3D Question Answering , 2021, IEEE transactions on visualization and computer graphics.

[6]  S. Fidler,et al.  Physics-based Human Motion Estimation and Synthesis from Videos , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[7]  Deqian Kong,et al.  YouRefIt: Embodied Reference Understanding with Language and Gesture , 2021, IEEE International Conference on Computer Vision.

[8]  Ruben Villegas,et al.  Stochastic Scene-Aware Motion Prediction , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[9]  Mohit Shridhar,et al.  Language Grounding with 3D Objects , 2021, CoRL.

[10]  Ali Farhadi,et al.  LanguageRefer: Spatial-Language Model for 3D Visual Grounding , 2021, CoRL.

[11]  Nikos Athanasiou,et al.  BABEL: Bodies, Action and Behavior with English Labels , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Bo Dai,et al.  Scene-aware Generative Network for Human Motion Synthesis , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Michael J. Black,et al.  Action-Conditioned 3D Human Motion Synthesis with Transformer VAE , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[14]  C. Theobalt,et al.  Synthesis of Compositional Animations from Textual Descriptions , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[15]  Song-Chun Zhu,et al.  VLGrammar: Grounded Grammar Induction of Vision and Language , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[16]  Joachim Tesch,et al.  Populating 3D Scenes by Learning Human-Scene Interaction , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  X. Wang,et al.  Synthesizing Long-Term 3D Human Motion and Interaction in 3D Scenes , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Michael J. Black,et al.  We are More than Our Joints: Predicting how 3D Bodies Move , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Klaus Dietmayer,et al.  Point Transformer , 2020, IEEE Access.

[20]  Weifeng Chen,et al.  Learning to Sit: Synthesizing Human-Chair Interactions via Hierarchical Control , 2019, AAAI.

[21]  Dimitrios Tzionas,et al.  GRAB: A Dataset of Whole-Body Human Grasping of Objects , 2020, ECCV.

[22]  Ahmed Abdelreheem,et al.  ReferIt3D: Neural Listeners for Fine-Grained 3D Object Identification in Real-World Scenes , 2020, ECCV.

[23]  Michael J. Black,et al.  Generating Person-Scene Interactions in 3D Scenes , 2020, ArXiv.

[24]  Yixin Zhu,et al.  LEMMA: A Multi-view Dataset for Learning Multi-agent Multi-task Activities , 2020, ECCV.

[25]  Shihao Zou,et al.  Action2Motion: Conditioned Generation of 3D Human Motions , 2020, ACM Multimedia.

[26]  Minh Vo,et al.  Long-term Human Motion Prediction with Scene Context , 2020, ECCV.

[27]  Kris M. Kitani,et al.  DLow: Diversifying Latent Flows for Diverse Human Motion Prediction , 2020, ECCV.

[28]  J. Tenenbaum,et al.  Look, Listen, and Act: Towards Audio-Visual Embodied Navigation , 2019, 2020 IEEE International Conference on Robotics and Automation (ICRA).

[29]  Angel X. Chang,et al.  ScanRefer: 3D Object Localization in RGB-D Scans using Natural Language , 2019, ECCV.

[30]  Michael J. Black,et al.  Generating 3D People in Scenes Without People , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Sebastian Starke,et al.  Neural state machine for character-scene interactions , 2019, ACM Trans. Graph..

[32]  Song-Chun Zhu,et al.  Holistic++ Scene Understanding: Single-View 3D Holistic Scene Parsing and Human Pose Estimation With Human-Object Interaction and Physical Commonsense , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[33]  Dimitrios Tzionas,et al.  Resolving 3D Human Pose Ambiguities With 3D Scene Constraints , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[34]  Louis-Philippe Morency,et al.  Language2Pose: Natural Language Grounded Pose Forecasting , 2019, 2019 International Conference on 3D Vision (3DV).

[35]  Michael Goesele,et al.  The Replica Dataset: A Digital Replica of Indoor Spaces , 2019, ArXiv.

[36]  Zhe Wang,et al.  Geometric Pose Affordance: 3D Human Pose with Scene Constraints , 2019, ArXiv.

[37]  Dhruv Batra,et al.  SplitNet: Sim2Sim and Task2Task Transfer for Embodied Visual Navigation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[38]  Dimitrios Tzionas,et al.  Expressive Body Capture: 3D Hands, Face, and Body From a Single Image , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  Nikolaus F. Troje,et al.  AMASS: Archive of Motion Capture As Surface Shapes , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[40]  Wei Liang,et al.  Functional Workspace Optimization via Learning Personal Preferences from Virtual Experiences , 2019, IEEE Transactions on Visualization and Computer Graphics.

[41]  Yi Zhou,et al.  On the Continuity of Rotation Representations in Neural Networks , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[42]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[43]  Zhen Zhang,et al.  Convolutional Sequence to Sequence Model for Human Dynamics , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[44]  Xiao Lin,et al.  Human Motion Modeling using DVGANs , 2018, ArXiv.

[45]  Stefan Lee,et al.  Embodied Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[46]  Zicheng Liu,et al.  HP-GAN: Probabilistic 3D Human Motion Prediction via GAN , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[47]  Qi Wu,et al.  Vision-and-Language Navigation: Interpreting Visually-Grounded Navigation Instructions in Real Environments , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[48]  Leonidas J. Guibas,et al.  PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space , 2017, NIPS.

[49]  Matthias Nießner,et al.  ScanNet: Richly-Annotated 3D Reconstructions of Indoor Scenes , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[50]  Matthias Nießner,et al.  PiGraphs , 2016, ACM Trans. Graph..

[51]  Honglak Lee,et al.  Learning Structured Output Representation using Deep Conditional Generative Models , 2015, NIPS.

[52]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[53]  Cristian Sminchisescu,et al.  Human3.6M: Large Scale Datasets and Predictive Methods for 3D Human Sensing in Natural Environments , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[54]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[55]  Jon Louis Bentley,et al.  Multidimensional binary search trees used for associative searching , 1975, CACM.