Methods and challenges for creating an emotional audio-visual database

Emotion has a very important role in human communication and can be expressed either verbally through speech (e.g. pitch, intonation, prosody etc), or by facial expressions, gestures etc. Most of the contemporary human-computer interaction are deficient in interpreting these information and hence suffers from lack of emotional intelligence. In other words, these systems are unable to identify human's emotional state and hence is not able to react properly. To overcome these inabilities, machines are required to be trained using annotated emotional data samples. Motivated from this fact, here we have attempted to collect and create an audio-visual emotional corpus. Audio-visual signals of multiple subjects were recorded when they were asked to watch either presentation (having background music) or emotional video clips. Post recording subjects were asked to express how they felt, and to read out sentences that appeared on the screen. Self annotation from the subject itself, as well as annotation from others have also been carried out to annotate the recorded data.

[1]  Valery A. Petrushin,et al.  EMOTION IN SPEECH: RECOGNITION AND APPLICATION TO CALL CENTERS , 1999 .

[2]  Kornel Laskowski,et al.  Emotion Recognition in Spontaneous Speech , 2009 .

[3]  Margaret McRorie,et al.  The Belfast Induced Natural Emotion Database , 2012, IEEE Transactions on Affective Computing.

[4]  Astrid Paeschke,et al.  A database of German emotional speech , 2005, INTERSPEECH.

[5]  Roddy Cowie,et al.  Emotional speech: Towards a new generation of databases , 2003, Speech Commun..

[6]  Maja Pantic,et al.  This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON AFFECTIVE COMPUTING , 2022 .

[7]  Fakhri Karray,et al.  Survey on speech emotion recognition: Features, classification schemes, and databases , 2011, Pattern Recognit..

[8]  Fabien Ringeval,et al.  Introducing the RECOLA multimodal corpus of remote collaborative and affective interactions , 2013, 2013 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG).

[9]  John H. L. Hansen,et al.  Getting started with SUSAS: a speech under simulated and actual stress database , 1997, EUROSPEECH.

[10]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[11]  Emmanuel Dellandréa,et al.  The MediaEval 2015 Affective Impact of Movies Task , 2015, MediaEval.

[12]  Sunil Kumar Kopparapu,et al.  Mining Call Center Conversations exhibiting Similar Affective States , 2016, PACLIC.