PHASE: PHysically-grounded Abstract Social Events for Machine Social Perception

The ability to perceive and reason about social interactions in the context of physical environments is core to human social intelligence and human-machine cooperation. However, no prior dataset or benchmark has systematically evaluated physically grounded perception of complex social interactions that go beyond short actions, such as high-fiving, or simple group activities, such as gathering. In this work, we create a dataset of physically-grounded abstract social events, PHASE, that resemble a wide range of real-life social interactions by including social concepts such as helping another agent. PHASE consists of 2D animations of pairs of agents moving in a continuous space generated procedurally using a physics engine and a hierarchical planner. Agents have a limited field of view, and can interact with multiple objects, in an environment that has multiple landmarks and obstacles. Using PHASE, we design a social recognition task and a social prediction task. PHASE is validated with human experiments demonstrating that humans perceive rich interactions in the social events, and that the simulated agents behave similarly to humans. As a baseline model, we introduce a Bayesian inverse planning approach, SIMPLE (SIMulation, Planning and Local Estimation), which outperforms state-of-the-art feedforward neural networks. We hope that PHASE can serve as a difficult new challenge for developing new models that can recognize complex social interactions.

[1]  Song-Chun Zhu,et al.  Joint inference of groups, events and human roles in aerial videos , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Daniel Yamins,et al.  Active World Model Learning in Agent-rich Environments with Progress Curiosity , 2020, ICML 2020.

[3]  Silvio Savarese,et al.  Social LSTM: Human Trajectory Prediction in Crowded Spaces , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Silvio Savarese,et al.  Social GAN: Socially Acceptable Trajectories with Generative Adversarial Networks , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[5]  Fabio Viola,et al.  The Kinetics Human Action Video Dataset , 2017, ArXiv.

[6]  Michael S. Ryoo,et al.  Learning Social Affordance for Human-Robot Interaction , 2016, IJCAI.

[7]  Li Wang,et al.  Learning Actor Relation Graphs for Group Activity Recognition , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Eric Bonabeau,et al.  Agent-based modeling: Methods and techniques for simulating human systems , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[9]  Bolei Zhou,et al.  Moments in Time Dataset: One Million Videos for Event Understanding , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[10]  Jessica B. Hamrick,et al.  Simulation as an engine of physical scene understanding , 2013, Proceedings of the National Academy of Sciences.

[11]  Chris L. Baker,et al.  Rational quantitative attribution of beliefs, desires and percepts in human mentalizing , 2017, Nature Human Behaviour.

[12]  Jake K. Aggarwal,et al.  Spatio-temporal relationship match: Video structure comparison for recognition of complex human activities , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[13]  Thomas A. Funkhouser,et al.  Semantic Scene Completion from a Single Depth Image , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Chuang Gan,et al.  CLEVRER: CoLlision Events for Video REpresentation and Reasoning , 2020, ICLR.

[15]  J. Tenenbaum,et al.  Adventures in Flatland: Perceiving Social Interactions Under Physical Dynamics , 2020, CogSci.

[16]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[17]  Nanning Zheng,et al.  Learning to infer human attention in daily activities , 2020, Pattern Recognit..

[18]  Joel Veness,et al.  Monte-Carlo Planning in Large POMDPs , 2010, NIPS.

[19]  Demis Hassabis,et al.  A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play , 2018, Science.

[20]  Remco C. Veltkamp,et al.  Spatio-Temporal Detection of Fine-Grained Dyadic Human Interactions , 2016, HBU.

[21]  Ian D. Reid,et al.  Structured Learning of Human Interactions in TV Shows , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[22]  Zhaoxin Li,et al.  STGAT: Modeling Spatial-Temporal Interactions for Human Trajectory Prediction , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[23]  H. Francis Song,et al.  Machine Theory of Mind , 2018, ICML.

[24]  J. Tenenbaum,et al.  Learning a commonsense moral theory , 2017, Cognition.

[25]  P. J. Gmytrasiewicz,et al.  A Framework for Sequential Planning in Multi-Agent Settings , 2005, AI&M.

[26]  Song-Chun Zhu,et al.  Partitioning the Perception of Physical and Social Events Within a Unified Psychological Space , 2019, CogSci.

[27]  Antonio M. López,et al.  The SYNTHIA Dataset: A Large Collection of Synthetic Images for Semantic Segmentation of Urban Scenes , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Song-Chun Zhu,et al.  Learning and Inferring “Dark Matter” and Predicting Human Intents and Trajectories in Videos , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[29]  Silvio Savarese,et al.  Understanding Collective Activitiesof People from Videos , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[30]  Martial Hebert,et al.  Activity Forecasting , 2012, ECCV.

[31]  F. Heider,et al.  An experimental study of apparent behavior , 1944 .

[32]  Brian J Scholl,et al.  Chasing vs. stalking: interrupting the perception of animacy. , 2011, Journal of experimental psychology. Human perception and performance.

[33]  Steven L. Lytinen,et al.  Agent-based Simulation Platforms: Review and Development Recommendations , 2006, Simul..

[34]  C. Schmid,et al.  Actions in context , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[35]  Li Fei-Fei,et al.  CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Subramanian Ramanathan,et al.  SALSA: A Novel Dataset for Multimodal Group Behavior Analysis , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[37]  Andrew S. Gordon,et al.  An Authoring Tool for Movies in the Style of Heider and Simmel , 2014, ICIDS.

[38]  C. Lawrence Zitnick,et al.  Adopting Abstract Images for Semantic Scene Understanding , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[39]  Cordelia Schmid,et al.  AVA: A Video Dataset of Spatio-Temporally Localized Atomic Visual Actions , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[40]  Dimitris Samaras,et al.  Two-person interaction detection using body-pose features and multiple instance learning , 2012, 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[41]  Minh Vo,et al.  Long-term Human Motion Prediction with Scene Context , 2020, ECCV.

[42]  Hang Zhao,et al.  HACS: Human Action Clips and Segments Dataset for Recognition and Temporal Localization , 2017, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[43]  Igor Mordatch,et al.  Emergent Tool Use From Multi-Agent Autocurricula , 2019, ICLR.

[44]  Richard Bowden,et al.  Hollywood 3D: Recognizing Actions in 3D Natural Scenes , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[45]  Takeo Kanade,et al.  Panoptic Studio: A Massively Multiview System for Social Motion Capture , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[46]  Leslie Pack Kaelbling,et al.  Hierarchical task and motion planning in the now , 2011, 2011 IEEE International Conference on Robotics and Automation.

[47]  Chenfanfu Jiang,et al.  Configurable 3D Scene Synthesis and 2D Image Rendering with Per-pixel Ground Truth Using Stochastic Grammars , 2017, International Journal of Computer Vision.

[48]  Joshua B. Tenenbaum,et al.  Help or Hinder: Bayesian Models of Social Goal Inference , 2009, NIPS.

[49]  Greg Mori,et al.  A Hierarchical Deep Temporal Model for Group Activity Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[50]  Drew H. Abney,et al.  Journal of Experimental Psychology : Human Perception and Performance Influence of Musical Groove on Postural Sway , 2015 .