论文信息 - Towards Social Artificial Intelligence: Nonverbal Social Signal Prediction in a Triadic Interaction

Towards Social Artificial Intelligence: Nonverbal Social Signal Prediction in a Triadic Interaction

We present a new research task and a dataset to understand human social interactions via computational methods, to ultimately endow machines with the ability to encode and decode a broad channel of social signals humans use. This research direction is essential to make a machine that genuinely communicates with humans, which we call Social Artificial Intelligence. We first formulate the ``social signal prediction'' problem as a way to model the dynamics of social signals exchanged among interacting individuals in a data-driven way. We then present a new 3D motion capture dataset to explore this problem, where the broad spectrum of social signals (3D body, face, and hand motions) are captured in a triadic social interaction scenario. Baseline approaches to predict speaking status, social formation, and body gestures of interacting individuals are presented in the defined social prediction framework.

[1] Martial Hebert,et al. An Uncertain Future: Forecasting from Static Images Using Variational Autoencoders , 2016, ECCV.

[2] Hans-Peter Seidel,et al. A data-driven approach for real-time full body pose reconstruction from a depth camera , 2011, 2011 International Conference on Computer Vision.

[3] H. Meeren,et al. Rapid perceptual integration of facial expression and emotional body language. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[4] Yiying Tong,et al. FaceWarehouse: A 3D Facial Expression Database for Visual Computing , 2014, IEEE Transactions on Visualization and Computer Graphics.

[5] Silvio Savarese,et al. Structural-RNN: Deep Learning on Spatio-Temporal Graphs , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6] Peter Robinson,et al. OpenFace: An open source facial behavior analysis toolkit , 2016, 2016 IEEE Winter Conference on Applications of Computer Vision (WACV).

[7] Yaser Sheikh,et al. OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[8] Michael J. Black,et al. HumanEva: Synchronized Video and Motion Capture Dataset and Baseline Algorithm for Evaluation of Articulated Human Motion , 2010, International Journal of Computer Vision.

[9] Luc Van Gool,et al. Recognizing emotions expressed by body pose: A biologically inspired neural model , 2008, Neural Networks.

[10] Rosalind W. Picard. Affective computing: challenges , 2003, Int. J. Hum. Comput. Stud..

[11] Maja Pantic,et al. Social signal processing: Survey of an emerging domain , 2009, Image Vis. Comput..

[12] Yaser Sheikh,et al. 3D Social Saliency from Head-mounted Cameras , 2012, NIPS.

[13] Takeo Kanade,et al. Panoptic Studio: A Massively Multiview System for Social Motion Capture , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[14] T. Kanade,et al. Reconstructing 3D Human Pose from 2D Image Landmarks , 2012, ECCV.

[15] Andrew W. Fitzgibbon,et al. Real-time human pose recognition in parts from single depth images , 2011, CVPR 2011.

[16] Erik Cambria,et al. Recent Trends in Deep Learning Based Natural Language Processing , 2017, IEEE Comput. Intell. Mag..

[17] Silvio Savarese,et al. Social GAN: Socially Acceptable Trajectories with Generative Adversarial Networks , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[18] Alessio Del Bue,et al. Social interaction discovery by statistical analysis of F-formations , 2011, BMVC.

[19] Jean Carletta,et al. The AMI meeting corpus , 2005 .

[20] Geoffrey E. Hinton,et al. Conditional Restricted Boltzmann Machines for Structured Output Prediction , 2011, UAI.

[21] Peter V. Gehler,et al. Keep It SMPL: Automatic Estimation of 3D Human Pose and Shape from a Single Image , 2016, ECCV.

[22] Ralph Gross,et al. The CMU Motion of Body (MoBo) Database , 2001 .

[23] Fernando De la Torre,et al. Selective Transfer Machine for Personalized Facial Action Unit Detection , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[24] Vittorio Murino,et al. Social interactions by visual focus of attention in a three‐dimensional environment , 2013, Expert Syst. J. Knowl. Eng..

[25] Rosalind W. Picard. Affective Computing , 1997 .

[26] Nitish Srivastava,et al. Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[27] A. Mehrabian. Silent Messages: Implicit Communication of Emotions and Attitudes , 1971 .

[28] Francesc Moreno-Noguer,et al. 3D Human Pose Estimation from a Single Image via Distance Matrix Regression , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29] C. Darwin. The Expression of the Emotions in Man and Animals , .

[30] James M. Rehg,et al. Decoding Children's Social Behavior , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[31] Sanjiv Kumar,et al. On the Convergence of Adam and Beyond , 2018 .

[32] Jonathan Tompson,et al. Efficient ConvNet-based marker-less motion capture in general scenes with a low number of cameras , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33] Shaogang Gong,et al. Facial expression recognition based on Local Binary Patterns: A comprehensive study , 2009, Image Vis. Comput..

[34] Kris M. Kitani,et al. Action-Reaction: Forecasting the Dynamics of Human Interaction , 2014, ECCV.

[35] Mark Steedman,et al. Animated conversation: rule-based generation of facial expression, gesture & spoken intonation for multiple conversational agents , 1994, SIGGRAPH.

[36] A. Mehrabian,et al. Inference of attitudes from nonverbal communication in two channels. , 1967, Journal of consulting psychology.

[37] Yi Yang,et al. Recognizing proxemics in personal photos , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[38] P. Ekman,et al. The Repertoire of Nonverbal Behavior: Categories, Origins, Usage, and Coding , 1969 .

[39] James M. Rehg,et al. Social interactions: A first-person perspective , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[40] Pietro Perona,et al. Microsoft COCO: Common Objects in Context , 2014, ECCV.

[41] Erik Cambria,et al. A review of affective computing: From unimodal analysis to multimodal fusion , 2017, Inf. Fusion.

[42] Cristian Sminchisescu,et al. Human3.6M: Large Scale Datasets and Predictive Methods for 3D Human Sensing in Natural Environments , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[43] Martial Hebert,et al. Activity Forecasting , 2012, ECCV.

[44] Subramanian Ramanathan,et al. SALSA: A Novel Dataset for Multimodal Group Behavior Analysis , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[45] E. Hall,et al. The Hidden Dimension , 1970 .

[46] P. Ekman,et al. Facial action coding system , 2019 .

[47] Hans-Peter Seidel,et al. Markerless Motion Capture of Multiple Characters Using Multiview Image Segmentation , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[48] Silvio Savarese,et al. Social LSTM: Human Trajectory Prediction in Crowded Spaces , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[49] Yichen Wei,et al. Towards 3D Human Pose Estimation in the Wild: A Weakly-Supervised Approach , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[50] Taku Komura,et al. A Deep Learning Framework for Character Motion Synthesis and Editing , 2016, ACM Trans. Graph..

[51] B. de Gelder. Why bodies? Twelve reasons for including bodily expressions in affective neuroscience. , 2009, Philosophical transactions of the Royal Society of London. Series B, Biological sciences.

[52] Pascal Fua,et al. Monocular 3D Human Pose Estimation in the Wild Using Improved CNN Supervision , 2016, 2017 International Conference on 3D Vision (3DV).

[53] M. Hickson,et al. NVC, nonverbal communication: Studies and applications , 1985 .

[54] Varun Ramakrishna,et al. Convolutional Pose Machines , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[55] Jia Deng,et al. Stacked Hourglass Networks for Human Pose Estimation , 2016, ECCV.

[56] Hans-Peter Seidel,et al. Motion capture using joint skeleton tracking and surface estimation , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[57] Helbing,et al. Social force model for pedestrian dynamics. , 1995, Physical review. E, Statistical physics, plasmas, fluids, and related interdisciplinary topics.

[58] James S. Simkin. Kinesics and Context: Essays on Body Motion Communication , 1972 .

[59] Luca Antiga,et al. Automatic differentiation in PyTorch , 2017 .

[60] James J. Little,et al. A Simple Yet Effective Baseline for 3d Human Pose Estimation , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[61] Elisa Ricci,et al. Space speaks: towards socially and personality aware visual surveillance , 2010, MPVA '10.

[62] Jitendra Malik,et al. Recurrent Network Models for Human Dynamics , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[63] Francesco Setti,et al. F-Formation Detection: Individuating Free-Standing Conversational Groups in Images , 2015, PloS one.

[64] Ruben Villegas,et al. Learning to Generate Long-term Future via Hierarchical Prediction , 2017, ICML.

[65] Yaser Sheikh,et al. Total Capture: A 3D Deformation Model for Tracking Faces, Hands, and Bodies , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[66] Jean Carletta,et al. The AMI Meeting Corpus: A Pre-announcement , 2005, MLMI.

[67] Scott E. Hudson,et al. Towards Robot Autonomy in Group Conversations: Understanding the Effects of Body Orientation and Gaze , 2017, 2017 12th ACM/IEEE International Conference on Human-Robot Interaction (HRI.

[68] Yaser Sheikh,et al. Hand Keypoint Detection in Single Images Using Multiview Bootstrapping , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[69] Y. Trope,et al. Body Cues, Not Facial Expressions, Discriminate Between Intense Positive and Negative Emotions , 2012, Science.

[70] Takeo Kanade,et al. Panoptic Studio: A Massively Multiview System for Social Interaction Capture , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.