A Multi-Stream Recurrent Neural Network for Social Role Detection in Multiparty Interactions

Understanding multiparty human interaction dynamics is a challenging problem involving multiple data modalities and complex ordered interactions between multiple people. We propose a unified framework that integrates synchronized video, audio, and text streams from four people to capture the interaction dynamics in natural group meetings. We focus on estimating the dynamic social role of the meeting participants, i.e., Protagonist, Neutral, Supporter, or Gatekeeper. Our key innovation is to incorporate both co-occurrence features and successive occurrence features in thin time windows to better describe the behavior of a target participant and his/her responses from others, using a multi-stream recurrent neural network. We evaluate our algorithm on the widely-used AMI corpus and achieve state-of-the-art accuracy of 78% for automatic dynamic social role detection. We further investigate the importance of different video and audio features for estimating social roles.

[1]  Yue Zhang,et al.  Context-Sensitive Lexicon Features for Neural Sentiment Analysis , 2016, EMNLP.

[2]  P. Ekman,et al.  Head and body cues in the judgment of emotion: a reformulation. , 1967, Perceptual and motor skills.

[3]  Maja Pantic,et al.  Social signal processing: Survey of an emerging domain , 2009, Image Vis. Comput..

[4]  Björn W. Schuller,et al.  Speaker independent emotion recognition by early fusion of acoustic and linguistic features within ensembles , 2005, INTERSPEECH.

[5]  Fabio Valente,et al.  Understanding social signals in multi-party conversations: Automatic recognition of socio-emotional roles in the AMI meeting corpus , 2011, 2011 IEEE International Conference on Systems, Man, and Cybernetics.

[6]  Sukhendu Das,et al.  See the Sound, Hear the Pixels , 2020, 2020 IEEE Winter Conference on Applications of Computer Vision (WACV).

[7]  Fadi Dornaika,et al.  Personality Traits and Job Candidate Screening via Analyzing Facial Videos , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[8]  Lizhong Xu,et al.  An image recognition method based on multiple BP neural networks fusion , 2004, International Conference on Information Acquisition, 2004. Proceedings..

[9]  J. Pennebaker,et al.  Psychological aspects of natural language. use: our words, our selves. , 2003, Annual review of psychology.

[10]  Louis-Philippe Morency,et al.  Multimodal Analysis and Prediction of Persuasiveness in Online Social Multimedia , 2016, ACM Trans. Interact. Intell. Syst..

[11]  Frédéric Jurie,et al.  MFAS: Multimodal Fusion Architecture Search , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Kenji Sakamoto,et al.  Analysis of multimodal interaction data in human communication , 1994, ICSLP.

[13]  Daniel Gatica-Perez,et al.  FaceTube: predicting personality from facial expressions of emotion in online conversational video , 2012, ICMI '12.

[14]  Hervé Bourlard,et al.  Automatic Recognition of Emergent Social Roles in Small Group Interactions , 2015, IEEE Transactions on Multimedia.

[15]  Marcel van Gerven,et al.  Deep Impression: Audiovisual Deep Residual Networks for Multimodal Apparent Personality Trait Recognition , 2016, ECCV Workshops.

[16]  J. Gabriel Amores,et al.  Multimodal fusion: a new hybrid strategy for dialogue systems , 2006, ICMI '06.

[17]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[18]  James W. Davis,et al.  The Recognition of Human Movement Using Temporal Templates , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[19]  Erik Cambria,et al.  Deep Convolutional Neural Network Textual Features and Multiple Kernel Learning for Utterance-level Multimodal Sentiment Analysis , 2015, EMNLP.

[20]  H. Gunes,et al.  Personality Classification from Robot-mediated Communication Cues , 2016 .

[21]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[22]  Junji Yamato,et al.  Recognizing communicative facial expressions for discovering interpersonal emotions in group meetings , 2009, ICMI-MLMI '09.

[23]  Geoffrey Beattie Rethinking Body Language: How Hand Movements Reveal Hidden Thoughts , 2016 .

[24]  Theodoros Giannakopoulos pyAudioAnalysis: An Open-Source Python Library for Audio Signal Analysis , 2015, PloS one.

[25]  Roberto Saia,et al.  Using neural word embeddings to model user behavior and detect user segments , 2016, Knowl. Based Syst..

[26]  Hyopil Shin,et al.  Language-Specific Sentiment Analysis in Morphologically Rich Languages , 2010, COLING.

[27]  Fabio Valente,et al.  Automatic speaker role labeling in AMI meetings: Recognition of formal and social roles , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[28]  Mohammed Bennamoun,et al.  A confidence-based late fusion framework for audio-visual biometric identification , 2015, Pattern Recognit. Lett..

[29]  Kuldip K. Paliwal,et al.  Bidirectional recurrent neural networks , 1997, IEEE Trans. Signal Process..

[30]  Daniel Gatica-Perez,et al.  Personality Trait Classification via Co-Occurrent Multiparty Multimodal Event Discovery , 2015, ICMI.

[31]  Jean Carletta,et al.  The AMI Meeting Corpus: A Pre-announcement , 2005, MLMI.

[32]  Frédéric Jurie,et al.  CentralNet: a Multilayer Approach for Multimodal Fusion , 2018, ECCV Workshops.

[33]  R Kalaivani,et al.  Additive Gaussian Noise Based Data Perturbation in Multi-Level Trust Privacy Preserving Data Mining , 2014 .

[34]  Hervé Bourlard,et al.  Investigating the Impact of Language Style and Vocal Expression on Social Roles of Participants in Professional Meetings , 2013, 2013 Humaine Association Conference on Affective Computing and Intelligent Interaction.

[35]  Yutaka Satoh,et al.  Learning Spatio-Temporal Features with 3D Residual Networks for Action Recognition , 2017, 2017 IEEE International Conference on Computer Vision Workshops (ICCVW).

[36]  Eric Granger,et al.  Multimodal Fusion with Deep Neural Networks for Audio-Video Emotion Recognition , 2019, ArXiv.

[37]  Xiu-Shen Wei,et al.  Deep Bimodal Regression of Apparent Personality Traits from Short Video Sequences , 2018, IEEE Transactions on Affective Computing.

[38]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[39]  Subramanian Ramanathan,et al.  Automatic modeling of personality states in small group interactions , 2011, MM '11.

[40]  Ning Xu,et al.  Learn to Combine Modalities in Multimodal Deep Learning , 2018, ArXiv.

[41]  Gary R. Bradski,et al.  Motion segmentation and pose recognition with motion history gradients , 2002, Machine Vision and Applications.

[42]  Suhuai Luo,et al.  Multiple Kernel-Based Multimedia Fusion for Automated Event Detection from Tweets , 2018, Machine Learning - Advanced Techniques and Emerging Applications.

[43]  Hugo Jair Escalante,et al.  Late fusion of heterogeneous methods for multimedia image retrieval , 2008, MIR '08.

[44]  James W. Davis,et al.  The Representation and Recognition of Action Using Temporal Templates , 1997, CVPR 1997.

[45]  Heng Ji,et al.  A Multimodal-Sensor-Enabled Room for Unobtrusive Group Meeting Analysis , 2018, ICMI.

[46]  B. Rajalingam,et al.  Multimodality Medical Image Fusion Based on Hybrid Fusion Techniques , 2017 .

[47]  Yannis Kalantidis,et al.  Less Is More: Learning Highlight Detection From Video Duration , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[48]  Fabio Pianesi,et al.  Automatic detection of group functional roles in face to face interactions , 2006, ICMI '06.

[49]  John R. Smith,et al.  IBM High-Five: Highlights From Intelligent Video Engine , 2017, ACM Multimedia.

[50]  Christopher O. Jaynes,et al.  Overconstrained Linear Estimation of Radial Distortion and Multi-view Geometry , 2006, ECCV.

[51]  Louis-Philippe Morency,et al.  OpenFace 2.0: Facial Behavior Analysis Toolkit , 2018, 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018).

[52]  Indrani Bhattacharya,et al.  Improved Visual Focus of Attention Estimation and Prosodic Features for Analyzing Group Interactions , 2019, ICMI.

[53]  Indrani Bhattacharya,et al.  Multiparty Visual Co-Occurrences for Estimating Personality Traits in Group Meetings , 2020, 2020 IEEE Winter Conference on Applications of Computer Vision (WACV).

[54]  Chuohao Yeo,et al.  Modeling Dominance in Group Conversations Using Nonverbal Activity Cues , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[55]  Harriet J. Nock,et al.  Discriminative model fusion for semantic concept detection and annotation in video , 2003, ACM Multimedia.

[56]  Daniel Gatica-Perez,et al.  Cross-domain personality prediction: from video blogs to small group meetings , 2013, ICMI '13.

[57]  Stéphane Ayache,et al.  Majority Vote of Diverse Classifiers for Late Fusion , 2014, S+SSPR.

[58]  D. Gática-Pérez,et al.  A Nonverbal Behavior Approach to Identify Emergent Leaders in Small Groups , 2012, IEEE Transactions on Multimedia.

[59]  Georges Quénot,et al.  Hierarchical Late Fusion for Concept Detection in Videos , 2012, ECCV Workshops.