Autonomous Emotion Learning in Speech: A View of Zero-Shot Speech Emotion Recognition

Conventionally, speech emotion recognition is achieved using passive learning approaches. Differing from such approaches, we herein propose and develop a dynamic method of autonomous emotion learning based on zero-shot learning. The proposed methodology employs emotional dimensions as the attributes in the zero-shot learning paradigm, resulting in two phases of learning, namely attribute learning and label learning. Attribute learning connects the paralinguistic features and attributes utilising speech with known emotional labels, while label learning aims at defining unseen emotions through the attributes. The experimental results achieved on the CINEMO corpus indicate that zero-shot learning is a useful technique for autonomous speech-based emotion learning, achieving accuracies considerably better than chance level and an attribute-based gold-standard setup. Furthermore, different emotion recognition tasks, emotional attributes, and employed approaches strongly influence system performance.

[1]  Andrew Y. Ng,et al.  Zero-Shot Learning Through Cross-Modal Transfer , 2013, NIPS.

[2]  S. Connelly,et al.  Negative emotions in informal feedback: The benefits of disappointment and drawbacks of anger , 2014 .

[3]  Hillary Anger Elfenbein,et al.  Mapping 24 emotions conveyed by brief human vocalization. , 2019, The American psychologist.

[4]  Shrikanth S. Narayanan,et al.  Fuzzy Logic Models for the Meaning of Emotion Words , 2013, IEEE Computational Intelligence Magazine.

[5]  Jessica L. Tracy,et al.  Expression of emotion , 2016 .

[6]  Björn W. Schuller,et al.  A Two-Dimensional Framework of Multiple Kernel Subspace Learning for Recognizing Emotion in Speech , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[7]  K. Scherer,et al.  On the Acoustics of Emotion in Audio: What Speech, Music, and Sound have in Common , 2013, Front. Psychol..

[8]  Yang Liu,et al.  A Multi-Task Learning Framework for Emotion Recognition Using 2D Continuous Space , 2017, IEEE Transactions on Affective Computing.

[9]  Eduardo Coutinho,et al.  The INTERSPEECH 2016 Computational Paralinguistics Challenge: Deception, Sincerity & Native Language , 2016, INTERSPEECH.

[10]  Ma Lin Considering relative order of emotional degree in dimensional speech emotion recognition , 2011 .

[11]  Björn W. Schuller,et al.  Context-Sensitive Learning for Enhanced Audiovisual Emotion Classification , 2012, IEEE Transactions on Affective Computing.

[12]  Eduardo Coutinho,et al.  Cooperative Learning and its Application to Emotion Recognition from Speech , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[13]  Andrew Gordon Wilson,et al.  Stochastic Variational Deep Kernel Learning , 2016, NIPS.

[14]  Laurence Devillers,et al.  Protocol CINEMO: The use of fiction for collecting emotional data in naturalistic controlled oriented context , 2009, 2009 3rd International Conference on Affective Computing and Intelligent Interaction and Workshops.

[15]  Sang Hee Kim,et al.  Neural Correlates of Positive and Negative Emotion Regulation , 2007, Journal of Cognitive Neuroscience.

[16]  Boyang Li,et al.  Video Emotion Recognition with Transferred Deep Feature Encodings , 2016, ICMR.

[17]  Björn W. Schuller,et al.  CINEMO - A French Spoken Language Resource for Complex Emotions: Facts and Baselines , 2010, LREC.

[18]  Yongzhao Zhan,et al.  Learning Salient Features for Speech Emotion Recognition Using Convolutional Neural Networks , 2014, IEEE Transactions on Multimedia.

[19]  Christoph H. Lampert,et al.  Attribute-Based Classification for Zero-Shot Visual Object Categorization , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[20]  Björn W. Schuller,et al.  Incremental acoustic valence recognition: an inter-corpus perspective on features, matching, and performance in a gating paradigm , 2010, INTERSPEECH.

[21]  Eduardo Coutinho,et al.  Connecting Subspace Learning and Extreme Learning Machine in Speech Emotion Recognition , 2019, IEEE Transactions on Multimedia.

[22]  Geoffrey E. Hinton,et al.  Zero-shot Learning with Semantic Output Codes , 2009, NIPS.

[23]  Andrew Gordon Wilson,et al.  Deep Kernel Learning , 2015, AISTATS.

[24]  Joseph Rynkiewicz General bound of overfitting for MLP regression models , 2011, ESANN.

[25]  Fabien Ringeval,et al.  Affective and behavioural computing: Lessons learnt from the First Computational Paralinguistics Challenge , 2019, Comput. Speech Lang..

[26]  Juan Pablo Wachs,et al.  A Semantical & Analytical Approach for Zero Shot Gesture Learning , 2017, 2017 12th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017).

[27]  Fabio Valente,et al.  The INTERSPEECH 2013 computational paralinguistics challenge: social signals, conflict, emotion, autism , 2013, INTERSPEECH.

[28]  Ailbhe Ní Chasaide,et al.  Voice-to-Affect Mapping: Inferences on Language Voice Baseline Settings , 2017, INTERSPEECH.

[29]  Björn W. Schuller,et al.  Recent developments in openSMILE, the munich open-source multimedia feature extractor , 2013, ACM Multimedia.

[30]  Carlos Busso,et al.  Jointly Predicting Arousal, Valence and Dominance with Multi-Task Learning , 2017, INTERSPEECH.

[31]  Bernt Schiele,et al.  Zero-Shot Learning — The Good, the Bad and the Ugly , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  peng song,et al.  Transfer Linear Subspace Learning for Cross-Corpus Speech Emotion Recognition , 2019, IEEE Transactions on Affective Computing.

[33]  Björn W. Schuller,et al.  Semisupervised Autoencoders for Speech Emotion Recognition , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.