Learning a Generative Model for Multi‐Step Human‐Object Interactions from Videos

Creating dynamic virtual environments consisting of humans interacting with objects is a fundamental problem in computer graphics. While it is well‐accepted that agent interactions play an essential role in synthesizing such scenes, most extant techniques exclusively focus on static scenes, leaving the dynamic component out. In this paper, we present a generative model to synthesize plausible multi‐step dynamic human‐object interactions. Generating multi‐step interactions is challenging since the space of such interactions is exponential in the number of objects, activities, and time steps. We propose to handle this combinatorial complexity by learning a lower dimensional space of plausible human‐object interactions. We use action plots to represent interactions as a sequence of discrete actions along with the participating objects and their states. To build action plots, we present an automatic method that uses state‐of‐the‐art computer vision techniques on RGB videos in order to detect individual objects and their states, extract the involved hands, and recognize the actions performed. The action plots are built from observing videos of everyday activities and are used to train a generative model based on a Recurrent Neural Network (RNN). The network learns the causal dependencies and constraints between individual actions and can be used to generate novel and diverse multi‐step human‐object interactions. Our representation and generative model allows new capabilities in a variety of applications such as interaction prediction, animation synthesis, and motion planning for a real robotic agent.

[1]  Rui Ma,et al.  Action-driven 3D indoor scene evolution , 2016, ACM Trans. Graph..

[2]  Pat Hanrahan,et al.  Example-based synthesis of 3D object arrangements , 2012, ACM Trans. Graph..

[3]  Ashutosh Saxena,et al.  rCRF: Recursive Belief Estimation over CRFs in RGB-D Activity Videos , 2015, Robotics: Science and Systems.

[4]  Maneesh Agrawala,et al.  Interactive furniture layout using interior design guidelines , 2011, SIGGRAPH 2011.

[5]  Irfan A. Essa,et al.  Recognizing multitasked activities from video using stochastic context-free grammar , 2002, AAAI/IAAI.

[6]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[7]  Mehrtash Tafazzoli Harandi,et al.  Going deeper into action recognition: A survey , 2016, Image Vis. Comput..

[8]  Jake K. Aggarwal,et al.  Spatio-temporal relationship match: Video structure comparison for recognition of complex human activities , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[9]  Sai Kit Yeung,et al.  Fill and Transfer: A Simple Physics-Based Approach for Containability Reasoning , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[10]  Matthias Nießner,et al.  Activity-centric scene synthesis for functional 3D scene modeling , 2015, ACM Trans. Graph..

[11]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[12]  Nancy Argüelles,et al.  Author ' s , 2008 .

[13]  Pat Hanrahan,et al.  SceneGrok: inferring action maps in 3D environments , 2014, ACM Trans. Graph..

[14]  Matthias Nießner,et al.  PiGraphs , 2016, ACM Trans. Graph..

[15]  Kai Liu,et al.  Model-driven indoor scenes modeling from a single image , 2015, Graphics Interface.

[16]  Jessica K. Hodgins,et al.  Generating and ranking diverse multi-character interactions , 2014, ACM Trans. Graph..

[17]  Sanja Fidler,et al.  VirtualHome: Simulating Household Activities Via Programs , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[18]  Jehee Lee,et al.  Motion Grammars for Character Animation , 2016, Comput. Graph. Forum.

[19]  Ehud Rivlin,et al.  Functional 3D Object Classification Using Simulation of Embodied Agent , 2006, BMVC.

[20]  Ariel Shamir,et al.  Learning how objects function via co-analysis of interactions , 2016, ACM Trans. Graph..

[21]  Ali Farhadi,et al.  See the Glass Half Full: Reasoning About Liquid Containers, Their Volume and Content , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[22]  Ariel Shamir,et al.  Filling Your Shelves: Synthesizing Diverse Style-Preserving Artifact Arrangements , 2014, IEEE Transactions on Visualization and Computer Graphics.

[23]  Kun Zhou,et al.  An interactive approach to semantic modeling of indoor scenes with an RGBD camera , 2012, ACM Trans. Graph..

[24]  Wei Liang,et al.  Tracking Occluded Objects and Recovering Incomplete Trajectories by Reasoning About Containment Relations and Human Actions , 2018, AAAI.

[25]  Leonidas J. Guibas,et al.  Understanding and Exploiting Object Interaction Landscapes , 2016, ACM Trans. Graph..

[26]  Kaiming He,et al.  Detecting and Recognizing Human-Object Interactions , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[27]  Song-Chun Zhu,et al.  Understanding tools: Task-oriented object modeling, learning and recognition , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Nanning Zheng,et al.  Modeling 4D Human-Object Interactions for Joint Event Segmentation, Recognition, and Object Localization , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[29]  James M. Rehg,et al.  Learning to recognize objects in egocentric activities , 2011, CVPR 2011.

[30]  Gregory D. Hager,et al.  Learning convolutional action primitives for fine-grained action recognition , 2016, 2016 IEEE International Conference on Robotics and Automation (ICRA).

[31]  Geoffrey E. Hinton,et al.  Generating Text with Recurrent Neural Networks , 2011, ICML.

[32]  Yoshua Bengio,et al.  On the Properties of Neural Machine Translation: Encoder–Decoder Approaches , 2014, SSST@EMNLP.

[33]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[34]  Manfred Lau,et al.  Behavior planning for character animation , 2005, SCA '05.

[35]  Rui Caseiro,et al.  Exploiting the Circulant Structure of Tracking-by-Detection with Kernels , 2012, ECCV.

[36]  Gregory D. Hager,et al.  Temporal Convolutional Networks: A Unified Approach to Action Segmentation , 2016, ECCV Workshops.

[37]  Song-Chun Zhu,et al.  Predicting Human Activities Using Stochastic Grammar , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[38]  Yoshua Bengio,et al.  Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling , 2014, ArXiv.

[39]  Jessica K. Hodgins,et al.  Constraint-based motion optimization using a statistical dynamic model , 2007, ACM Trans. Graph..

[40]  Song-Chun Zhu,et al.  Learning Perceptual Causality from Video , 2013, AAAI Workshop: Learning Rich Representations from Low-Level Sensors.

[41]  Katsu Yamane,et al.  Synthesizing animations of human manipulation tasks , 2004, ACM Trans. Graph..

[42]  Antti Oulasvirta,et al.  Interactive Markerless Articulated Hand Motion Tracking Using RGB and Depth Data , 2013, 2013 IEEE International Conference on Computer Vision.

[43]  Lydia E. Kavraki,et al.  The Open Motion Planning Library , 2012, IEEE Robotics & Automation Magazine.

[44]  Niloy J. Mitra,et al.  Creating consistent scene graphs using a probabilistic grammar , 2014, ACM Trans. Graph..

[45]  Trevor Darrell,et al.  Fully Convolutional Networks for Semantic Segmentation , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[46]  Michiel van de Panne,et al.  Task-based locomotion , 2016, ACM Trans. Graph..

[47]  Benjamin Z. Yao,et al.  Unsupervised learning of event AND-OR grammar and semantics from video , 2011, 2011 International Conference on Computer Vision.

[48]  Leonidas J. Guibas,et al.  Shape2Pose , 2014, ACM Trans. Graph..

[49]  Wei Liang,et al.  What Is Where: Inferring Containment Relations from Videos , 2016, IJCAI.

[50]  Jehee Lee,et al.  Precomputing avatar behavior from human motion data , 2004, SCA '04.

[51]  Hema Swetha Koppula,et al.  Learning human activities and object affordances from RGB-D videos , 2012, Int. J. Robotics Res..

[52]  Yunde Jia,et al.  Parsing video events with goal inference and intent prediction , 2011, 2011 International Conference on Computer Vision.

[53]  Cheng Liang,et al.  Mobility‐Trees for Indoor Scenes Manipulation , 2014, Comput. Graph. Forum.

[54]  Yi Li,et al.  Robot Learning Manipulation Action Plans by "Watching" Unconstrained Videos from the World Wide Web , 2015, AAAI.

[55]  Cordelia Schmid,et al.  Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[56]  Neil T. Dantam,et al.  The Motion Grammar: Analysis of a Linguistic Method for Robot Control , 2013, IEEE Transactions on Robotics.

[57]  C. Karen Liu,et al.  Synthesis of concurrent object manipulation tasks , 2012, ACM Trans. Graph..

[58]  Shi-Min Hu,et al.  Sketch2Scene: sketch-based co-retrieval and co-placement of 3D models , 2013, ACM Trans. Graph..

[59]  Sergey Levine,et al.  Continuous character control with low-dimensional embeddings , 2012, ACM Trans. Graph..

[60]  Chi-Keung Tang,et al.  Make it home: automatic optimization of furniture arrangement , 2011, ACM Trans. Graph..