Speaker Personality Recognition With Multimodal Explicit Many2many Interactions

Recently, speaker personality analysis has become an increasingly popular research task in human-computer interaction. Previous studies of user personality traits recognition normally focus on leveraging static information, i.e., tweets, images and social relationships in social platforms and websites. However, in this paper, we utilize three kinds of speaking dynamic information, i.e., textual, visual and acoustic temporal sequences, for a computer to interpret human personality traits from a face-to-face monologue. Specifically, we propose an explicit many2many (many-to-many) interactive approach to help AI efficiently recognize speaker personality traits. On the one hand, we encode the long feature sequence of human speaking for each modality with bidirectional LSTM network. On the other hand, we design a many2many attention mechanism explicitly to capture the interactions across multiple modalities for multiple interactive pairs. Empirical evaluation on 12 kinds of personality traits demonstrates the effectiveness of our proposed approach to multimodal speaker personality recognition.

[1]  Guodong Zhou,et al.  Modeling the Clause-Level Structure to Multimodal Sentiment Analysis via Reinforcement Learning , 2019, 2019 IEEE International Conference on Multimedia and Expo (ICME).

[2]  Louis-Philippe Morency,et al.  Deep multimodal fusion for persuasiveness prediction , 2016, ICMI.

[3]  Matthew D. Zeiler ADADELTA: An Adaptive Learning Rate Method , 2012, ArXiv.

[4]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[5]  Guodong Zhou,et al.  Joint Learning on Relevant User Attributes in Micro-blog , 2017, IJCAI.

[6]  Guodong Zhou,et al.  Effective Sentiment-relevant Word Selection for Multi-modal Sentiment Analysis in Spoken Language , 2019, ACM Multimedia.

[7]  Erik Cambria,et al.  Memory Fusion Network for Multi-view Sequential Learning , 2018, AAAI.

[8]  Guodong Zhou,et al.  User Classification with Multiple Textual Perspectives , 2016, COLING.

[9]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[10]  Erik Cambria,et al.  Multi-attention Recurrent Network for Human Communication Comprehension , 2018, AAAI.

[11]  Fei Su,et al.  Modeling of User Portrait Through Social Media , 2018, 2018 IEEE International Conference on Multimedia and Expo (ICME).

[12]  Guodong Zhou,et al.  Multi-Modal Language Analysis with Hierarchical Interaction-Level and Selection-Level Attentions , 2019, 2019 IEEE International Conference on Multimedia and Expo (ICME).

[13]  John Kane,et al.  COVAREP — A collaborative voice analysis repository for speech technologies , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  Louis-Philippe Morency,et al.  Combining Two Perspectives on Classifying Multimodal Data for Recognizing Speaker Traits , 2015, ICMI.

[15]  Mark Liberman,et al.  Speaker identification on the SCOTUS corpus , 2008 .

[16]  Erik Cambria,et al.  Tensor Fusion Network for Multimodal Sentiment Analysis , 2017, EMNLP.

[17]  Louis-Philippe Morency,et al.  Computational Analysis of Persuasiveness in Social Multimedia: A Novel Dataset and Multimodal Prediction Approach , 2014, ICMI.