Attention-Based Multimodal Fusion for Estimating Human Emotion in Real-World HRI

Toward empathetic and harmonious human-robot interaction (HRI), automatic estimation of human emotion has attracted increasing attention from multidisciplinary research fields. In this report, we propose an attention-based multimodal fusion approach that explores the space between traditional early and late fusion approaches, to deal with the problem of asynchronous multimodal inputs while considering their relatedness. The proposed approach enables the robot to align the human's visual and speech signals (more specifically, facial, acoustic, and lexical information) extracted by its cameras, microphones, and processing modules and is expected to achieve robust estimation performance in real-world HRI.

[1]  Björn W. Schuller,et al.  AVEC 2014: 3D Dimensional Affect and Depression Recognition Challenge , 2014, AVEC '14.

[2]  Yoshua Bengio,et al.  On the Properties of Neural Machine Translation: Encoder–Decoder Approaches , 2014, SSST@EMNLP.

[3]  Chung-Hsien Wu,et al.  Survey on audiovisual emotion recognition: databases, features, and data fusion strategies , 2014, APSIPA Transactions on Signal and Information Processing.

[4]  Johanna D. Moore,et al.  Recognizing emotions in spoken dialogue with hierarchically fused acoustic and lexical features , 2016, 2016 IEEE Spoken Language Technology Workshop (SLT).

[5]  Lei Xie,et al.  A coupled HMM approach to video-realistic speech animation , 2007, Pattern Recognit..

[6]  Ze-Jing Chuang,et al.  Multi-Modal Emotion Recognition from Speech and Text , 2004, ROCLING/IJCLCLP.

[7]  Tatsuya Kawahara,et al.  Emotion recognition by combining prosody and sentiment analysis for expressing reactive emotion by humanoid robot , 2017, 2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC).

[8]  Louis-Philippe Morency,et al.  Modeling Latent Discriminative Dynamic of Multi-dimensional Affective Signals , 2011, ACII.

[9]  Björn Schuller,et al.  Opensmile: the munich versatile and fast open-source audio feature extractor , 2010, ACM Multimedia.

[10]  Yuanchao Li,et al.  Towards Improving Speech Emotion Recognition for In-Vehicle Agents: Preliminary Results of Incorporating Sentiment Analysis by Using Early and Late Fusion Methods , 2018, HAI.

[11]  Sidney K. D'Mello,et al.  Consistent but modest: a meta-analysis on unimodal and multimodal affect detection accuracies from 30 studies , 2012, ICMI '12.

[12]  Jozef Magdolen,et al.  Voice control of smart home by using Google Cloud Speech-To-Text API , 2018 .

[13]  Steven Greenberg,et al.  Speech intelligibility derived from asynchronous processing of auditory-visual information , 2001, AVSP.

[14]  Zhihong Zeng,et al.  A Survey of Affect Recognition Methods: Audio, Visual, and Spontaneous Expressions , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[15]  Michel F. Valstar,et al.  Local Gabor Binary Patterns from Three Orthogonal Planes for Automatic Facial Expression Recognition , 2013, 2013 Humaine Association Conference on Affective Computing and Intelligent Interaction.

[16]  Hui Zhang,et al.  Learning Alignment for Multimodal Emotion Recognition from Speech , 2019, INTERSPEECH.

[17]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[18]  E. Vesterinen,et al.  Affective Computing , 2009, Encyclopedia of Biometrics.

[19]  Chung-Hsien Wu,et al.  Emotion recognition of conversational affective speech using temporal course modeling-based error weighted cross-correlation model , 2014, Signal and Information Processing Association Annual Summit and Conference (APSIPA), 2014 Asia-Pacific.

[20]  John R. Hershey,et al.  Attention-Based Multimodal Fusion for Video Description , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).