Automatic speech emotion detection system using multi-domain acoustic feature selection and classification models

Emotions exhibited by a speaker can be detected by analyzing his/her speech, facial expressions and gestures or by combining these properties. This paper concentrates on determining the emotional state from speech signals. Various acoustic features such as energy, zero crossing rate(ZCR), fundamental frequency, Mel Frequency Cepstral Coefficients (MFCCs), etc are extracted for short term, overlapping frames derived from the speech signal. A feature vector for every utterance is then constructed by analyzing the global statistics (mean, median, etc) of the extracted features over all frames. To select a subset of useful features from the full candidate feature vector, sequential backward selection (SBS) method is used with k-fold cross validation. Detection of emotion in the samples is done by classifying their respective feature vectors into classes, using either a pre-trained Support Vector Machine (SVM) model or Linear Discriminant Analysis (LDA) classifier. This approach is tested with two acted emotional databases - Berlin Database of Emotional Speech (EmoDB), and BML Emotion Database (RED). For multi class classification, accuracy of 80% for EmoDB and 73% for RED is achieved which are higher than or comparable to previous works on both the databases.

[1]  Elisabeth André,et al.  Improving Automatic Emotion Recognition from Speech via Gender Differentiaion , 2006, LREC.

[2]  Zheng-Hua Tan,et al.  Low-Complexity Variable Frame Rate Analysis for Speech Recognition and Voice Activity Detection , 2010, IEEE Journal of Selected Topics in Signal Processing.

[3]  Carlos Busso,et al.  Analysis of Emotionally Salient Aspects of Fundamental Frequency for Emotion Detection , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[4]  Igor Bisio,et al.  Gender-Driven Emotion Recognition Through Speech Signals For Ambient Intelligence Applications , 2013, IEEE Transactions on Emerging Topics in Computing.

[5]  Kirill Nikitin,et al.  Design of Automatic Speech Emotion Recognition System , 2015 .

[6]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[7]  Paula Lopez-Otero,et al.  iVectors for Continuous Emotion Recognition , 2014 .

[8]  Ling Guan,et al.  Recognizing Human Emotional State From Audiovisual Signals , 2008, IEEE Transactions on Multimedia.

[9]  Daniel Zilber,et al.  Detecting Emotion in Human Speech , 2011 .

[10]  Astrid Paeschke,et al.  A database of German emotional speech , 2005, INTERSPEECH.

[11]  Elisabeth André,et al.  EmoVoice - A Framework for Online Recognition of Emotions from Voice , 2008, PIT.

[12]  Gyanendra K. Verma,et al.  Spontaneous Affect Recognition from Audio-visual Cues using Multi-resolution Analysis , 2014 .

[13]  Alex Pappachen James,et al.  Detection and Analysis of Emotion From Speech Signals , 2015, ArXiv.

[14]  Valery A. Petrushin,et al.  EMOTION IN SPEECH: RECOGNITION AND APPLICATION TO CALL CENTERS , 1999 .

[15]  Ling Guan,et al.  A new audiovisual emotion recognition system using entropy-estimation-based multimodal information fusion , 2015, 2015 IEEE International Symposium on Circuits and Systems (ISCAS).

[16]  Tim Polzehl,et al.  Improving Automatic Emotion Recognition from speech using Rhythm and Temporal feature , 2013, ArXiv.

[17]  Theodoros Giannakopoulos,et al.  Chapter 4 – Audio Features , 2014 .