Towards Social Artificial Intelligence: Nonverbal Social Signal Prediction in a Triadic Interaction

We present a new research task and a dataset to understand human social interactions via computational methods, to ultimately endow machines with the ability to encode and decode a broad channel of social signals humans use. This research direction is essential to make a machine that genuinely communicates with humans, which we call Social Artificial Intelligence. We first formulate the ``social signal prediction'' problem as a way to model the dynamics of social signals exchanged among interacting individuals in a data-driven way. We then present a new 3D motion capture dataset to explore this problem, where the broad spectrum of social signals (3D body, face, and hand motions) are captured in a triadic social interaction scenario. Baseline approaches to predict speaking status, social formation, and body gestures of interacting individuals are presented in the defined social prediction framework.

[1]  Martial Hebert,et al.  An Uncertain Future: Forecasting from Static Images Using Variational Autoencoders , 2016, ECCV.

[2]  Hans-Peter Seidel,et al.  A data-driven approach for real-time full body pose reconstruction from a depth camera , 2011, 2011 International Conference on Computer Vision.

[3]  H. Meeren,et al.  Rapid perceptual integration of facial expression and emotional body language. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[4]  Yiying Tong,et al.  FaceWarehouse: A 3D Facial Expression Database for Visual Computing , 2014, IEEE Transactions on Visualization and Computer Graphics.

[5]  Silvio Savarese,et al.  Structural-RNN: Deep Learning on Spatio-Temporal Graphs , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Peter Robinson,et al.  OpenFace: An open source facial behavior analysis toolkit , 2016, 2016 IEEE Winter Conference on Applications of Computer Vision (WACV).

[7]  Yaser Sheikh,et al.  OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[8]  Michael J. Black,et al.  HumanEva: Synchronized Video and Motion Capture Dataset and Baseline Algorithm for Evaluation of Articulated Human Motion , 2010, International Journal of Computer Vision.

[9]  Luc Van Gool,et al.  Recognizing emotions expressed by body pose: A biologically inspired neural model , 2008, Neural Networks.

[10]  Rosalind W. Picard Affective computing: challenges , 2003, Int. J. Hum. Comput. Stud..

[11]  Maja Pantic,et al.  Social signal processing: Survey of an emerging domain , 2009, Image Vis. Comput..

[12]  Yaser Sheikh,et al.  3D Social Saliency from Head-mounted Cameras , 2012, NIPS.

[13]  Takeo Kanade,et al.  Panoptic Studio: A Massively Multiview System for Social Motion Capture , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[14]  T. Kanade,et al.  Reconstructing 3D Human Pose from 2D Image Landmarks , 2012, ECCV.

[15]  Andrew W. Fitzgibbon,et al.  Real-time human pose recognition in parts from single depth images , 2011, CVPR 2011.

[16]  Erik Cambria,et al.  Recent Trends in Deep Learning Based Natural Language Processing , 2017, IEEE Comput. Intell. Mag..

[17]  Silvio Savarese,et al.  Social GAN: Socially Acceptable Trajectories with Generative Adversarial Networks , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[18]  Alessio Del Bue,et al.  Social interaction discovery by statistical analysis of F-formations , 2011, BMVC.

[19]  Jean Carletta,et al.  The AMI meeting corpus , 2005 .

[20]  Geoffrey E. Hinton,et al.  Conditional Restricted Boltzmann Machines for Structured Output Prediction , 2011, UAI.

[21]  Peter V. Gehler,et al.  Keep It SMPL: Automatic Estimation of 3D Human Pose and Shape from a Single Image , 2016, ECCV.

[22]  Ralph Gross,et al.  The CMU Motion of Body (MoBo) Database , 2001 .

[23]  Fernando De la Torre,et al.  Selective Transfer Machine for Personalized Facial Action Unit Detection , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[24]  Vittorio Murino,et al.  Social interactions by visual focus of attention in a three‐dimensional environment , 2013, Expert Syst. J. Knowl. Eng..

[25]  Rosalind W. Picard Affective Computing , 1997 .

[26]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[27]  A. Mehrabian Silent Messages: Implicit Communication of Emotions and Attitudes , 1971 .

[28]  Francesc Moreno-Noguer,et al.  3D Human Pose Estimation from a Single Image via Distance Matrix Regression , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  C. Darwin The Expression of the Emotions in Man and Animals , .

[30]  James M. Rehg,et al.  Decoding Children's Social Behavior , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[31]  Sanjiv Kumar,et al.  On the Convergence of Adam and Beyond , 2018 .

[32]  Jonathan Tompson,et al.  Efficient ConvNet-based marker-less motion capture in general scenes with a low number of cameras , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Shaogang Gong,et al.  Facial expression recognition based on Local Binary Patterns: A comprehensive study , 2009, Image Vis. Comput..

[34]  Kris M. Kitani,et al.  Action-Reaction: Forecasting the Dynamics of Human Interaction , 2014, ECCV.

[35]  Mark Steedman,et al.  Animated conversation: rule-based generation of facial expression, gesture & spoken intonation for multiple conversational agents , 1994, SIGGRAPH.

[36]  A. Mehrabian,et al.  Inference of attitudes from nonverbal communication in two channels. , 1967, Journal of consulting psychology.

[37]  Yi Yang,et al.  Recognizing proxemics in personal photos , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[38]  P. Ekman,et al.  The Repertoire of Nonverbal Behavior: Categories, Origins, Usage, and Coding , 1969 .

[39]  James M. Rehg,et al.  Social interactions: A first-person perspective , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[40]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[41]  Erik Cambria,et al.  A review of affective computing: From unimodal analysis to multimodal fusion , 2017, Inf. Fusion.

[42]  Cristian Sminchisescu,et al.  Human3.6M: Large Scale Datasets and Predictive Methods for 3D Human Sensing in Natural Environments , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[43]  Martial Hebert,et al.  Activity Forecasting , 2012, ECCV.

[44]  Subramanian Ramanathan,et al.  SALSA: A Novel Dataset for Multimodal Group Behavior Analysis , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[45]  E. Hall,et al.  The Hidden Dimension , 1970 .

[46]  P. Ekman,et al.  Facial action coding system , 2019 .

[47]  Hans-Peter Seidel,et al.  Markerless Motion Capture of Multiple Characters Using Multiview Image Segmentation , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[48]  Silvio Savarese,et al.  Social LSTM: Human Trajectory Prediction in Crowded Spaces , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[49]  Yichen Wei,et al.  Towards 3D Human Pose Estimation in the Wild: A Weakly-Supervised Approach , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[50]  Taku Komura,et al.  A Deep Learning Framework for Character Motion Synthesis and Editing , 2016, ACM Trans. Graph..

[51]  B. de Gelder Why bodies? Twelve reasons for including bodily expressions in affective neuroscience. , 2009, Philosophical transactions of the Royal Society of London. Series B, Biological sciences.

[52]  Pascal Fua,et al.  Monocular 3D Human Pose Estimation in the Wild Using Improved CNN Supervision , 2016, 2017 International Conference on 3D Vision (3DV).

[53]  M. Hickson,et al.  NVC, nonverbal communication: Studies and applications , 1985 .

[54]  Varun Ramakrishna,et al.  Convolutional Pose Machines , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[55]  Jia Deng,et al.  Stacked Hourglass Networks for Human Pose Estimation , 2016, ECCV.

[56]  Hans-Peter Seidel,et al.  Motion capture using joint skeleton tracking and surface estimation , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[57]  Helbing,et al.  Social force model for pedestrian dynamics. , 1995, Physical review. E, Statistical physics, plasmas, fluids, and related interdisciplinary topics.

[58]  James S. Simkin Kinesics and Context: Essays on Body Motion Communication , 1972 .

[59]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[60]  James J. Little,et al.  A Simple Yet Effective Baseline for 3d Human Pose Estimation , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[61]  Elisa Ricci,et al.  Space speaks: towards socially and personality aware visual surveillance , 2010, MPVA '10.

[62]  Jitendra Malik,et al.  Recurrent Network Models for Human Dynamics , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[63]  Francesco Setti,et al.  F-Formation Detection: Individuating Free-Standing Conversational Groups in Images , 2015, PloS one.

[64]  Ruben Villegas,et al.  Learning to Generate Long-term Future via Hierarchical Prediction , 2017, ICML.

[65]  Yaser Sheikh,et al.  Total Capture: A 3D Deformation Model for Tracking Faces, Hands, and Bodies , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[66]  Jean Carletta,et al.  The AMI Meeting Corpus: A Pre-announcement , 2005, MLMI.

[67]  Scott E. Hudson,et al.  Towards Robot Autonomy in Group Conversations: Understanding the Effects of Body Orientation and Gaze , 2017, 2017 12th ACM/IEEE International Conference on Human-Robot Interaction (HRI.

[68]  Yaser Sheikh,et al.  Hand Keypoint Detection in Single Images Using Multiview Bootstrapping , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[69]  Y. Trope,et al.  Body Cues, Not Facial Expressions, Discriminate Between Intense Positive and Negative Emotions , 2012, Science.

[70]  Takeo Kanade,et al.  Panoptic Studio: A Massively Multiview System for Social Interaction Capture , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.