GRIP: Generating Interaction Poses Using Latent Consistency and Spatial Cues

Hands are dexterous and highly versatile manipulators that are central to how humans interact with objects and their environment. Consequently, modeling realistic hand-object interactions, including the subtle motion of individual fingers, is critical for applications in computer graphics, computer vision, and mixed reality. Prior work on capturing and modeling humans interacting with objects in 3D focuses on the body and object motion, often ignoring hand pose. In contrast, we introduce GRIP, a learning-based method that takes, as input, the 3D motion of the body and the object, and synthesizes realistic motion for both hands before, during, and after object interaction. As a preliminary step before synthesizing the hand motion, we first use a network, ANet, to denoise the arm motion. Then, we leverage the spatio-temporal relationship between the body and the object to extract two types of novel temporal interaction cues, and use them in a two-stage inference pipeline to generate the hand motion. In the first stage, we introduce a new approach to enforce motion temporal consistency in the latent space (LTC), and generate consistent interaction motions. In the second stage, GRIP generates refined hand poses to avoid hand-object penetrations. Given sequences of noisy body and object motion, GRIP upgrades them to include hand-object interaction. Quantitative experiments and perceptual studies demonstrate that GRIP outperforms baseline methods and generalizes to unseen objects and motions from different motion-capture datasets.

[1]  Michael J. Black,et al.  InterCap: Joint Markerless 3D Tracking of Humans and Objects in Interaction , 2022, GCPR.

[2]  Bharat Lal Bhatnagar,et al.  BEHAVE: Dataset and Method for Tracking Human Object Interactions , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Michael J. Black,et al.  GOAL: Generating 4D Whole-Body Motion for Hand-Object Grasping , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  M. Kocabas,et al.  D-Grasp: Physically Plausible Dynamic Grasp Synthesis for Hand-Object Interactions , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Pulkit Agrawal,et al.  A System for General In-Hand Object Re-Orientation , 2021, CoRL.

[6]  Yi Sun,et al.  Toward Human-Like Grasp: Dexterous Grasping via Semantic Representation of Object-Hand , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[7]  Adrian Spurr,et al.  A Skeleton-Driven Neural Occupancy Representation for Articulated Hands , 2021, 2021 International Conference on 3D Vision (3DV).

[8]  Ruben Villegas,et al.  Stochastic Scene-Aware Motion Prediction , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[9]  Charles C. Kemp,et al.  ContactOpt: Optimizing Contact to Improve Grasps , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Xiaolong Wang,et al.  Hand-Object Contact Consistency Reasoning for Human Grasps Generation , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[11]  X. Wang,et al.  Synthesizing Long-Term 3D Human Motion and Interaction in 3D Scenes , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Marc Toussaint,et al.  MoGaze: A Dataset of Full-Body Motions that Includes Workspace Geometry and Eye-Gaze , 2020, IEEE Robotics and Automation Letters.

[13]  Dimitrios Tzionas,et al.  GRAB: A Dataset of Whole-Body Human Grasping of Objects , 2020, ECCV.

[14]  Yan Zhang,et al.  Grasping Field: Learning Implicit Representations for Human Grasps , 2020, 2020 International Conference on 3D Vision (3DV).

[15]  Guillermo Garcia-Hernando,et al.  Physics-Based Dexterous Manipulations with Estimated Hand Poses and Residual Reinforcement Learning , 2020, 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[16]  Charles C. Kemp,et al.  ContactPose: A Dataset of Grasps with Object Contact and Hand Pose , 2020, ECCV.

[17]  Francesc Moreno-Noguer,et al.  GanHand: Predicting Human Grasp Affordances in Multi-Object Scenes , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Yana Hasson,et al.  Leveraging Photometric Consistency Over Time for Sparsely Supervised Hand-Object Reconstruction , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Michael J. Black,et al.  Generating 3D People in Scenes Without People , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Daniel Holden,et al.  DReCon , 2019, ACM Trans. Graph..

[21]  Sunmin Lee,et al.  Learning predict-and-simulate policies from unorganized human motion data , 2019, ACM Trans. Graph..

[22]  Sebastian Starke,et al.  Neural state machine for character-scene interactions , 2019, ACM Trans. Graph..

[23]  Weifeng Chen,et al.  Learning to Sit: Synthesizing Human-Chair Interactions via Hierarchical Control , 2019, AAAI.

[24]  V. Lepetit,et al.  HOnnotate: A Method for 3D Annotation of Hand and Object Poses , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Leonidas J. Guibas,et al.  Learning a Generative Model for Multi‐Step Human‐Object Interactions from Videos , 2019, Comput. Graph. Forum.

[26]  Charles C. Kemp,et al.  ContactDB: Analyzing and Predicting Grasp Contact via Thermal Imaging , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Dimitrios Tzionas,et al.  Expressive Body Capture: 3D Hands, Face, and Body From a Single Image , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Michael J. Black,et al.  Learning Joint Reconstruction of Hands and Manipulated Objects , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Yi Zhou,et al.  On the Continuity of Rotation Representations in Neural Networks , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Jakub W. Pachocki,et al.  Learning dexterous in-hand manipulation , 2018, Int. J. Robotics Res..

[31]  Sergey Levine,et al.  Learning Complex Dexterous Manipulation with Deep Reinforcement Learning and Demonstrations , 2017, Robotics: Science and Systems.

[32]  Leonidas J. Guibas,et al.  Shape-aware spatio-temporal descriptors for interaction classification , 2017, 2017 IEEE International Conference on Image Processing (ICIP).

[33]  Glen Berseth,et al.  DeepLoco , 2017, ACM Trans. Graph..

[34]  Shanxin Yuan,et al.  First-Person Hand Action Benchmark with RGB-D Videos and 3D Hand Pose Annotations , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[35]  Leonidas J. Guibas,et al.  Understanding and Exploiting Object Interaction Landscapes , 2016, ACM Trans. Graph..

[36]  Glen Berseth,et al.  Terrain-adaptive locomotion skills using deep reinforcement learning , 2016, ACM Trans. Graph..

[37]  Markus H. Gross,et al.  Precision: precomputing environment semantics for contact-rich character animation , 2016, I3D.

[38]  Michael J. Black,et al.  SMPL: A Skinned Multi-Person Linear Model , 2023 .

[39]  Tamim Asfour,et al.  The KIT whole-body human motion database , 2015, 2015 International Conference on Advanced Robotics (ICAR).

[40]  Marc Pollefeys,et al.  Capturing Hands in Action Using Discriminative Salient Points and Physics Simulation , 2015, International Journal of Computer Vision.

[41]  Jinxiang Chai,et al.  Robust realtime physics-based motion control for human grasping , 2013, ACM Trans. Graph..

[42]  Danica Kragic,et al.  Data-Driven Grasp Synthesis—A Survey , 2013, IEEE Transactions on Robotics.

[43]  Taku Komura,et al.  Relationship descriptors for interactive motion adaptation , 2013, SCA '13.

[44]  Zoran Popovic,et al.  Contact-invariant optimization for hand manipulation , 2012, SCA '12.

[45]  C. Karen Liu,et al.  Synthesis of detailed hand manipulations using contact sampling , 2012, ACM Trans. Graph..

[46]  Stefan Ulbrich,et al.  OpenGRASP: A Toolkit for Robot Grasping Simulation , 2010, SIMPAR.

[47]  Edmond S. L. Ho,et al.  Spatial relationship preserving character motion adaptation , 2010, ACM Trans. Graph..

[48]  C. Karen Liu,et al.  Dextrous manipulation from a grasping pose , 2009, ACM Trans. Graph..

[49]  Ying Li,et al.  Data-Driven Grasp Synthesis Using Shape Matching and Task-Based Pruning , 2007, IEEE Transactions on Visualization and Computer Graphics.

[50]  Jehee Lee,et al.  Motion patches: building blocks for virtual environments annotated with motion data , 2006, ACM Trans. Graph..

[51]  Dinesh K. Pai,et al.  Interaction capture and synthesis , 2005, ACM Trans. Graph..

[52]  Victor B. Zordan,et al.  Physically based grasping control from example , 2005, SCA '05.

[53]  Jessica K. Hodgins,et al.  Interactive control of avatars animated with human motion data , 2002, SIGGRAPH.

[54]  Michael Gleicher,et al.  Retargetting motion to new characters , 1998, SIGGRAPH.

[55]  Bharat Lal Bhatnagar,et al.  TOCH: Spatio-Temporal Object Correspondence to Hand for Motion Refinement , 2022, ArXiv.

[56]  Taku Komura,et al.  ManipNet , 2021, ACM Trans. Graph..

[57]  P. Bidaud,et al.  3D Objects Grasps Synthesis: A Survey , 2011 .