GOAL: Generating 4D Whole-Body Motion for Hand-Object Grasping

Generating digital humans that move realistically has many applications and is widely studied, but existing methods focus on the major limbs of the body, ignoring the hands and head. Hands have been separately studied but the focus has been on generating realistic static grasps of objects. To synthesize virtual characters that interact with the world, we need to generate full-body motions and realistic hand grasps simultaneously. Both sub-problems are challenging on their own and, together, the state space of poses is significantly larger, the scales of hand and body motions differ, and the whole-body posture and the hand grasp must agree, satisfy physical constraints, and be plausible. Additionally, the head is involved because the avatar must look at the object to interact with it. For the first time, we address the problem of generating full-body, hand and head motions of an avatar grasping an unknown object. As input, our method, called GOAL, takes a 3D object, its position, and a starting 3D body pose and shape. GOAL outputs a sequence of whole-body poses using two novel networks. First, GNet generates a goal whole-body grasp with a realistic body, head, arm, and hand pose, as well as handobject contact. Second, MNet generates the motion between the starting and goal pose. This is challenging, as it requires the avatar to walk towards the object with footground contact, orient the head towards it, reach out, and grasp it with a realistic hand pose and hand-object contact. To achieve this the networks exploit a representation that combines SMPL-X body parameters and 3D vertex offsets. We train and evaluate GOAL, both qualitatively and quantitatively, on the GRAB dataset. Results show that GOAL generalizes well to unseen objects, outperforming baselines. A perceptual study shows that GOAL’s generated motions approach the realism of GRAB’s ground truth. GOAL takes a step towards synthesizing realistic full-body object grasping. Our models and code will be available at https://goal.is.tuebingen.mpg.de. 1 ar X iv :2 11 2. 11 45 4v 1 [ cs .C V ] 2 1 D ec 2 02 1

[1]  Leonidas J. Guibas,et al.  Understanding and Exploiting Object Interaction Landscapes , 2016, ACM Trans. Graph..

[2]  Michael J. Black,et al.  We are More than Our Joints: Predicting how 3D Bodies Move , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Francesc Moreno-Noguer,et al.  GanHand: Predicting Human Grasp Affordances in Multi-Object Scenes , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Yi Zhou,et al.  On the Continuity of Rotation Representations in Neural Networks , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Mathieu Salzmann,et al.  History Repeats Itself: Human Motion Prediction via Motion Attention , 2020, ECCV.

[6]  Hongdong Li,et al.  Learning Trajectory Dependencies for Human Motion Prediction , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[7]  Taku Komura,et al.  Relationship descriptors for interactive motion adaptation , 2013, SCA '13.

[8]  Sebastian Starke,et al.  Neural state machine for character-scene interactions , 2019, ACM Trans. Graph..

[9]  Cordelia Schmid,et al.  Learning Joint Reconstruction of Hands and Manipulated Objects , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[11]  Tamim Asfour,et al.  A whole-body pose taxonomy for loco-manipulation tasks , 2015, 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[12]  Dinesh K. Pai,et al.  Interaction capture and synthesis , 2005, ACM Trans. Graph..

[13]  Ying He,et al.  A Sketching Interface for Sitting Pose Design in the Virtual Environment , 2012, IEEE Transactions on Visualization and Computer Graphics.

[14]  N. Heess,et al.  Catch & Carry: Reusable Neural Controllers for Vision-Guided Whole-Body Tasks , 2019 .

[15]  Michael Gleicher,et al.  Retargetting motion to new characters , 1998, SIGGRAPH.

[16]  Weifeng Chen,et al.  Learning to Sit: Synthesizing Human-Chair Interactions via Hierarchical Control , 2019, AAAI.

[17]  Guillermo Garcia-Hernando,et al.  Physics-Based Dexterous Manipulations with Estimated Hand Poses and Residual Reinforcement Learning , 2020, 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[18]  Ye Yuan,et al.  DLow: Diversifying Latent Flows for Diverse Human Motion Prediction , 2020, ECCV.

[19]  Jessica K. Hodgins,et al.  Interactive control of avatars animated with human motion data , 2002, SIGGRAPH.

[20]  Jingwei Xu,et al.  Synthesizing Long-Term 3D Human Motion and Interaction in 3D Scenes , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Edmond S. L. Ho,et al.  Spatial relationship preserving character motion adaptation , 2010, ACM Trans. Graph..

[22]  Yan Zhang,et al.  PLACE: Proximity Learning of Articulation and Contact in 3D Environments , 2020, 2020 International Conference on 3D Vision (3DV).

[23]  K HodginsJessica,et al.  Interactive control of avatars animated with human motion data , 2002 .

[24]  Leonidas J. Guibas,et al.  HuMoR: 3D Human Motion Model for Robust Pose Estimation , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[25]  Michael J. Black,et al.  SMPL: A Skinned Multi-Person Linear Model , 2023 .

[26]  Glen Berseth,et al.  Terrain-adaptive locomotion skills using deep reinforcement learning , 2016, ACM Trans. Graph..

[27]  Luc Van Gool,et al.  What makes a chair a chair? , 2011, CVPR 2011.

[28]  Yan Zhang,et al.  Generating 3D People in Scenes Without People , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Norman I. Badler,et al.  Simulating humans: computer graphics animation and control , 1993 .

[30]  Pulkit Agrawal,et al.  A System for General In-Hand Object Re-Orientation , 2021, CoRL.

[31]  Michael J. Black,et al.  On Human Motion Prediction Using Recurrent Neural Networks , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  M WangJack,et al.  Gaussian Process Dynamical Models for Human Motion , 2008 .

[33]  James J. Little,et al.  A Simple Yet Effective Baseline for 3d Human Pose Estimation , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[34]  Niloy J. Mitra,et al.  Ergonomics-Inspired Reshaping and Exploration of Collections of Models , 2016, IEEE Transactions on Visualization and Computer Graphics.

[35]  Dimitrios Tzionas,et al.  Embodied Hands: Modeling and Capturing Hands and Bodies Together , 2022, ArXiv.

[36]  Cristian Sminchisescu,et al.  THUNDR: Transformer-based 3D HUmaN Reconstruction with Markers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[37]  Yan Zhang,et al.  Grasping Field: Learning Implicit Representations for Human Grasps , 2020, 2020 International Conference on 3D Vision (3DV).

[38]  Karan Singh,et al.  Eurographics/siggraph Symposium on Computer Animation (2003) Handrix: Animating the Human Hand , 2003 .

[39]  Shihao Zou,et al.  Action2Motion: Conditioned Generation of 3D Human Motions , 2020, ACM Multimedia.

[40]  Jehee Lee,et al.  Motion patches: building blocks for virtual environments annotated with motion data , 2006, ACM Trans. Graph..

[41]  Dimitrios Tzionas,et al.  Expressive Body Capture: 3D Hands, Face, and Body From a Single Image , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[42]  Jehee Lee,et al.  Motion patches: buildings blocks for virtual environments annotated with motion data , 2005, SIGGRAPH Sketches.

[43]  Aaron Hertzmann,et al.  Style machines , 2000, SIGGRAPH 2000.

[44]  Dimitrios Tzionas,et al.  GRAB: A Dataset of Whole-Body Human Grasping of Objects , 2020, ECCV.

[45]  Christoph Lassner,et al.  Efficient Learning on Point Clouds With Basis Point Sets , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[46]  Charles C. Kemp,et al.  ContactOpt: Optimizing Contact to Improve Grasps , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[47]  Przemyslaw Musialski,et al.  Pose to Seat: Automated Design of Body-Supporting Surfaces , 2020, Comput. Aided Geom. Des..

[48]  Sergey Levine,et al.  DeepMimic , 2018, ACM Trans. Graph..

[49]  Kathleen M. Robinette,et al.  Civilian American and European Surface Anthropometry Resource (CAESAR), Final Report. Volume 1. Summary , 2002 .

[50]  Markus H. Gross,et al.  Precision: precomputing environment semantics for contact-rich character animation , 2016, I3D.

[51]  Taku Komura,et al.  Phase-functioned neural networks for character control , 2017, ACM Trans. Graph..

[52]  Mathis Petrovich,et al.  Action-Conditioned 3D Human Motion Synthesis with Transformer VAE , 2021, ArXiv.

[53]  Tomás Lozano-Pérez,et al.  Imitation Learning of Whole-Body Grasps , 2006, 2006 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[54]  Taku Komura,et al.  A Deep Learning Framework for Character Motion Synthesis and Editing , 2016, ACM Trans. Graph..

[55]  Francesc Moreno-Noguer,et al.  Context-Aware Human Motion Prediction , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[56]  Joachim Tesch,et al.  Populating 3D Scenes by Learning Human-Scene Interaction , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[57]  C. Karen Liu,et al.  Synthesis of detailed hand manipulations using contact sampling , 2012, ACM Trans. Graph..

[58]  David J. Fleet,et al.  This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE Gaussian Process Dynamical Model , 2007 .

[59]  David A. Ross,et al.  AI Choreographer: Music Conditioned 3D Dance Generation with AIST++ , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[60]  Ruben Villegas,et al.  Stochastic Scene-Aware Motion Prediction , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[61]  Jitendra Malik,et al.  Recurrent Network Models for Human Dynamics , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[62]  C. Lee Giles,et al.  A Neural Temporal Model for Human Motion Prediction , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[63]  Michael J. Black,et al.  MoSh: motion and shape capture from sparse markers , 2014, ACM Trans. Graph..

[64]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[65]  Sung-Hee Lee,et al.  Environment-adaptive contact poses for virtual characters , 2014, SIGGRAPH '14.

[66]  Victor B. Zordan,et al.  Physically based grasping control from example , 2005, SCA '05.

[67]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[68]  Tamim Asfour,et al.  The KIT whole-body human motion database , 2015, 2015 International Conference on Advanced Robotics (ICAR).