Efficient Recognition of Human Emotional States from Audio Signals

Automatic recognition of human emotional states is an important task for efficient human-machine communication. Most of existing works focus on the recognition of emotional states using audio signals alone, visual signals alone, or both. Here we propose empirical methods for feature extraction and classifier optimization that consider the temporal aspects of audio signals and introduce our framework to efficiently recognize human emotional states from audio signals. The framework is based on the prediction of input audio clips that are described using representative low-level features. In the experiments, seven (7) discrete emotional states (anger, fear, boredom, disgust, happiness, sadness, and neutral) from EmoDB dataset, are recognized and tested based on nineteen (19) audio features (15 standalone, 4 joint) by using the Support Vector Machine (SVM) classifier. Extensive experiments have been conducted to demonstrate the effect of feature extraction and classifier optimization methods to the recognition accuracy of the emotional states. Our experiments show that, feature extraction and classifier optimization procedures lead to significant improvement of over 11% in emotion recognition. As a result, the overall recognition accuracy achieved for seven emotions in the EmoDB dataset is 83.33% compared to the baseline accuracy of 72.22%.

[1]  Ling Guan,et al.  Recognizing Human Emotional State From Audiovisual Signals , 2008, IEEE Transactions on Multimedia.

[2]  Björn W. Schuller,et al.  Automatic recognition of emotion evoked by general sound events , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Astrid Paeschke,et al.  A database of German emotional speech , 2005, INTERSPEECH.

[4]  Yi-Hsuan Yang,et al.  Ranking-Based Emotion Recognition for Music Organization and Retrieval , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[5]  Mohammad Soleymani,et al.  A Multimodal Database for Affect Recognition and Implicit Tagging , 2012, IEEE Transactions on Affective Computing.

[6]  Zhihong Zeng,et al.  A Survey of Affect Recognition Methods: Audio, Visual, and Spontaneous Expressions , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[7]  Adnan Yazici,et al.  Audio Feature and Classifier Analysis for Efficient Recognition of Environmental Sounds , 2013, 2013 IEEE International Symposium on Multimedia.

[8]  Mohan M. Trivedi,et al.  Speech Emotion Analysis: Exploring the Role of Context , 2010, IEEE Transactions on Multimedia.

[9]  Benoit Huet,et al.  Features for multimodal emotion recognition: An extensive study , 2010, 2010 IEEE Conference on Cybernetics and Intelligent Systems.

[10]  Giovanni De Poli,et al.  Score-Independent Audio Features for Description of Music Expression , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[11]  Elisabeth André,et al.  Emotion recognition based on physiological changes in music listening , 2008, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[12]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[13]  George A. Tsihrintzis,et al.  Evaluation of MPEG-7 Descriptors for Speech Emotional Recognition , 2012, 2012 Eighth International Conference on Intelligent Information Hiding and Multimedia Signal Processing.