DECO: Dense Estimation of 3D Human-Scene Contact In The Wild

Understanding how humans use physical contact to interact with the world is key to enabling human-centric artificial intelligence. While inferring 3D contact is crucial for modeling realistic and physically-plausible human-object interactions, existing methods either focus on 2D, consider body joints rather than the surface, use coarse 3D body regions, or do not generalize to in-the-wild images. In contrast, we focus on inferring dense, 3D contact between the full body surface and objects in arbitrary images. To achieve this, we first collect DAMON, a new dataset containing dense vertex-level contact annotations paired with RGB images containing complex human-object and human-scene contact. Second, we train DECO, a novel 3D contact detector that uses both body-part-driven and scene-context-driven attention to estimate vertex-level contact on the SMPL body. DECO builds on the insight that human observers recognize contact by reasoning about the contacting body parts, their proximity to scene objects, and the surrounding scene context. We perform extensive evaluations of our detector on DAMON as well as on the RICH and BEHAVE datasets. We significantly outperform existing SOTA methods across all benchmarks. We also show qualitatively that DECO generalizes well to diverse and challenging real-world human interactions in natural images. The code, data, and models are available at https://deco.is.tue.mpg.de.

[1]  Michael J. Black,et al.  BITE: Beyond Priors for Improved Three-D Dog Pose Estimation , 2023, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Michael J. Black,et al.  3D Human Pose Estimation via Intuitive Physics , 2023, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Bharat Lal Bhatnagar,et al.  Visibility Aware Human-Object Interaction Tracking from Single RGB Camera , 2023, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Michael J. Black,et al.  Detecting Human-Object Contact in Images , 2023, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Michael J. Black,et al.  MIME: Human-Aware 3D Scene Generation , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Michael J. Black,et al.  InterCap: Joint Markerless 3D Tracking of Humans and Objects in Interaction , 2022, GCPR.

[7]  Jianzhuang Liu,et al.  CLIFF: Carrying Location Information in Full Frames into Human Pose and Shape Estimation , 2022, ECCV.

[8]  Michael J. Black,et al.  Capturing and Inferring Dense Full-Body Human-Scene Contact , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Michael J. Black,et al.  Accurate 3D Body Shape Regression using Metric and Semantic Attributes , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  M. Andriluka,et al.  Differentiable Dynamics for Articulated 3d Human Motion Reconstruction , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  C. Theobalt,et al.  HULC: 3D Human Motion Capture with Pose Manifold Sampling and Dense Contact Guidance , 2022, ECCV.

[12]  Jing Zhang,et al.  ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation , 2022, NeurIPS.

[13]  Bharat Lal Bhatnagar,et al.  BEHAVE: Dataset and Method for Tracking Human Object Interactions , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Bharat Lal Bhatnagar,et al.  CHORE: Contact, Human and Object REconstruction from a single RGB image , 2022, ECCV.

[15]  Kris Kitani,et al.  Occluded Human Mesh Recovery , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Christopher D. Twigg,et al.  PressureVision: Estimating Hand Pressure from a Single RGB Image , 2022, ECCV.

[17]  Michael J. Black,et al.  GOAL: Generating 4D Whole-Body Motion for Hand-Object Grasping , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  A. Schwing,et al.  Masked-attention Mask Transformer for Universal Image Segmentation , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Marc Pollefeys,et al.  Learning Motion Priors for 4D Human Body Capture in 3D Scenes , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[20]  Dimitrios Tzionas,et al.  Collaborative Regression of Expressive Bodies using Moderation , 2021, 2021 International Conference on 3D Vision (3DV).

[21]  Leonidas J. Guibas,et al.  HuMoR: 3D Human Motion Model for Robust Pose Estimation , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[22]  Eun-Sol Kim,et al.  HOTR: End-to-End Human-Object Interaction Detection with Transformers , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Michael J. Black,et al.  PARE: Part Attention Regressor for 3D Human Body Estimation , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[24]  Charles C. Kemp,et al.  ContactOpt: Optimizing Contact to Improve Grasps , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Michael J. Black,et al.  On Self-Contact and Human Pose , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Kris Kitani,et al.  SimPoE: Simulated Character Control for 3D Human Pose Estimation , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Torsten Sattler,et al.  Human POSEitioning System (HPS): 3D Human Pose Estimation and Self-localization in Large Scenes from Body-Mounted Sensors , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Zhenan Sun,et al.  PyMAF: 3D Human Pose and Shape Regression with Pyramidal Mesh Alignment Feedback Loop , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[29]  Jian Sun,et al.  End-to-End Human Object Interaction Detection with HOI Transformer , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Joachim Tesch,et al.  Populating 3D Scenes by Learning Human-Scene Interaction , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Cristian Sminchisescu,et al.  Learning Complex 3D Human Self-Contact , 2020, AAAI.

[32]  Kevin Lin,et al.  End-to-End Human Pose and Mesh Reconstruction with Transformers , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Ilija Radosavovic,et al.  Reconstructing Hand-Object Interactions in the Wild , 2020, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[34]  Serena Yeung,et al.  Holistic 3D Human and Scene Mesh Estimation from Single View Images , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Minh Hoai,et al.  Detecting Hands and Recognizing Physical Contact in the Wild , 2020, NeurIPS.

[36]  Dimitrios Tzionas,et al.  GRAB: A Dataset of Whole-Body Human Grasping of Objects , 2020, ECCV.

[37]  Yan Zhang,et al.  PLACE: Proximity Learning of Articulation and Contact in 3D Environments , 2020, 2020 International Conference on 3D Vision (3DV).

[38]  Deva Ramanan,et al.  Perceiving 3D Human-Object Spatial Arrangements from a Single Image in the Wild , 2020, ECCV.

[39]  Leonidas J. Guibas,et al.  Contact and Human Dynamics from Monocular Video , 2020, SCA.

[40]  Charles C. Kemp,et al.  ContactPose: A Dataset of Grasps with Object Contact and Hand Pose , 2020, ECCV.

[41]  Minh Vo,et al.  Long-term Human Motion Prediction with Scene Context , 2020, ECCV.

[42]  Andreas Aristidou,et al.  MotioNet , 2020, ACM Trans. Graph..

[43]  David F. Fouhey,et al.  Understanding Human Hands in Contact at Internet Scale , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  Cristian Sminchisescu,et al.  Three-Dimensional Reconstruction of Human Interactions , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  Cewu Lu,et al.  PaStaNet: Toward Human Activity Knowledge Engine , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[46]  Greg Turk,et al.  Bodies at Rest: 3D Human Pose and Shape Estimation From a Pressure Image Using Synthetic Data , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[47]  Jimei Yang,et al.  Reducing Footskate in Human Motion Reconstruction with Ground Contact Constraints , 2020, 2020 IEEE Winter Conference on Applications of Computer Vision (WACV).

[48]  Michael J. Black,et al.  Generating 3D People in Scenes Without People , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[49]  Wan-Yen Lo,et al.  Accelerating 3D deep learning with PyTorch3D , 2019, SIGGRAPH Asia 2020 Courses.

[50]  Jorma Laaksonen,et al.  Deep Contextual Attention for Human-Object Interaction Detection , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[51]  Michael J. Black,et al.  Learning to Reconstruct 3D Human Pose and Shape via Model-Fitting in the Loop , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[52]  Yang Zhao,et al.  Deep High-Resolution Representation Learning for Visual Recognition , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[53]  Dimitrios Tzionas,et al.  Resolving 3D Human Pose Ambiguities With 3D Scene Constraints , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[54]  Liu Wu,et al.  Human Mesh Recovery From Monocular Images via a Skeleton-Disentangled Representation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[55]  Stefan Lee,et al.  ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks , 2019, NeurIPS.

[56]  Mohan S. Kankanhalli,et al.  Learning to Detect Human-Object Interactions With Knowledge , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[57]  Michael J. Black,et al.  Learning Joint Reconstruction of Hands and Manipulated Objects , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[58]  Meng Wang,et al.  Graphonomy: Universal Human Parsing via Graph Transfer Learning , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[59]  Nikolaus F. Troje,et al.  AMASS: Archive of Motion Capture As Surface Shapes , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[60]  Yaser Sheikh,et al.  OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[61]  Yanxi Liu,et al.  Learning Dynamics from Kinematics: Estimating 2D Foot Pressure Maps from Video Frames , 2018, 1811.12607.

[62]  Song-Chun Zhu,et al.  Learning Human-Object Interactions by Graph Parsing Neural Networks , 2018, ECCV.

[63]  Cristian Sminchisescu,et al.  Monocular 3D Pose and Shape Estimation of Multiple People in Natural Scenes: The Importance of Multiple Scene Constraints , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[64]  Jitendra Malik,et al.  End-to-End Recovery of Human Shape and Pose , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[65]  Bernt Schiele,et al.  PoseTrack: A Benchmark for Human Pose Estimation and Tracking , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[66]  A. Gupta,et al.  Binge Watching: Scaling Affordance Learning from Sitcoms , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[67]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[68]  Hans-Peter Seidel,et al.  VNect , 2017, ACM Trans. Graph..

[69]  Ross B. Girshick,et al.  Mask R-CNN , 2017, 1703.06870.

[70]  Pascal Fua,et al.  Learning to Fuse 2D and 3D Image Cues for Monocular Body Pose Estimation , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[71]  Sinisa Todorovic,et al.  A Multi-scale CNN for Affordance Segmentation in RGB Images , 2016, ECCV.

[72]  Geoffrey E. Hinton,et al.  Layer Normalization , 2016, ArXiv.

[73]  Chenfanfu Jiang,et al.  Inferring Forces and Learning Human Utilities from Videos , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[74]  Michael J. Black,et al.  SMPL: A Skinned Multi-Person Linear Model , 2023 .

[75]  Jitendra Malik,et al.  Visual Semantic Role Labeling , 2015, ArXiv.

[76]  K. Gwet Handbook of Inter-Rater Reliability: The Definitive Guide to Measuring the Extent of Agreement Among Raters , 2014 .

[77]  Hema Swetha Koppula,et al.  Physically Grounded Spatio-temporal Object Affordances , 2014, ECCV.

[78]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[79]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[80]  David A. Forsyth,et al.  Knowing when to put your foot down , 2006, I3D '06.

[81]  Masanobu Yamamoto,et al.  Scene constraints-aided tracking of human body , 2000, Proceedings IEEE Conference on Computer Vision and Pattern Recognition. CVPR 2000 (Cat. No.PR00662).

[82]  Peter Secretan Learning , 1965, Mental Health.

[83]  D. Therriault,et al.  of three-dimensional , 2023 .

[84]  Eduard Gabriel Bazavan,et al.  REMIPS: Physically Consistent 3D Reconstruction of Multiple Interacting People under Weak Supervision , 2021, NeurIPS.

[85]  Stephen Lin,et al.  Swin Transformer: Hierarchical Vision Transformer using Shifted Windows , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[86]  Yanxi Liu,et al.  From Image to Stability: Learning Dynamics from Human Pose , 2020, ECCV.

[87]  Michael J. Black,et al.  Embodied Hands : Modeling and Capturing Hands and Bodies Together * * Supplementary Material * * , 2017 .