Enhanced multiclass SVM with thresholding fusion for speech-based emotion classification

As an essential approach to understanding human interactions, emotion classification is a vital component of behavioral studies as well as being important in the design of context-aware systems. Recent studies have shown that speech contains rich information about emotion, and numerous speech-based emotion classification methods have been proposed. However, the classification performance is still short of what is desired for the algorithms to be used in real systems. We present an emotion classification system using several one-against-all support vector machines with a thresholding fusion mechanism to combine the individual outputs, which provides the functionality to effectively increase the emotion classification accuracy at the expense of rejecting some samples as unclassified. Results show that the proposed system outperforms three state-of-the-art methods and that the thresholding fusion mechanism can effectively improve the emotion classification, which is important for applications that require very high accuracy but do not require that all samples be classified. We evaluate the system performance for several challenging scenarios including speaker-independent tests, tests on noisy speech signals, and tests using non-professional acted recordings, in order to demonstrate the performance of the system and the effectiveness of the thresholding fusion mechanism in real scenarios.

[1]  Wendi B. Heinzelman,et al.  Emotion classification: How does an automated system compare to Naive human coders? , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  S. Scott,et al.  Perceptual Cues in Nonverbal Vocal Expressions of Emotion , 2010 .

[3]  Simon King,et al.  Analysis of statistical parametric and unit selection speech synthesis systems applied to emotional speech , 2010, Speech Commun..

[4]  Klaus R. Scherer,et al.  Vocal communication of emotion: A review of research paradigms , 2003, Speech Commun..

[5]  Fuhui Long,et al.  Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy , 2003, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[6]  Patricia K. Kerig,et al.  Couple observational coding systems , 2004 .

[7]  Chih-Jen Lin,et al.  A Practical Guide to Support Vector Classication , 2008 .

[8]  Shiqing Zhang,et al.  Speech Emotion Recognition Using an Enhanced Kernel Isomap for Human-Robot Interaction , 2013 .

[9]  Gijs Huisman,et al.  LEMtool: measuring emotions in visual interfaces , 2013, CHI.

[10]  O. Mayora,et al.  Activity and emotion recognition to support early diagnosis of psychiatric diseases , 2008, Pervasive 2008.

[11]  Tomoki Toda,et al.  GMM-based voice conversion applied to emotional speech synthesis , 2003, INTERSPEECH.

[12]  Minghui Dong,et al.  Machine Learning Methods in the Application of Speech Emotion Recognition , 2010 .

[13]  Andreas Stolcke,et al.  Prosody-based automatic detection of annoyance and frustration in human-computer dialog , 2002, INTERSPEECH.

[14]  Wendi B. Heinzelman,et al.  Speech-based emotion classification using multiclass SVM with hybrid kernel and thresholding fusion , 2012, 2012 IEEE Spoken Language Technology Workshop (SLT).

[15]  Alex Acero,et al.  Spoken Language Processing , 2001 .

[16]  Athanasios Katsamanis,et al.  Toward automating a human behavioral coding system for married couples' interactions using speech acoustic features , 2013, Speech Commun..

[17]  Björn W. Schuller,et al.  Combining frame and turn-level information for robust recognition of emotions within speech , 2007, INTERSPEECH.

[18]  Roberto Battiti,et al.  Using mutual information for selecting features in supervised neural net learning , 1994, IEEE Trans. Neural Networks.

[19]  H. Timothy Bunnell,et al.  Emotion Identification for Evaluation of Synthesized Emotional Speech , 2012 .

[20]  Jeffrey F. Cohn,et al.  Detecting Depression Severity from Vocal Prosody , 2013, IEEE Transactions on Affective Computing.

[21]  Ellen Riloff,et al.  Toward Plot Units: Automatic Affect State Analysis , 2010, HLT-NAACL 2010.

[22]  Björn W. Schuller,et al.  Speech emotion recognition combining acoustic features and linguistic information in a hybrid support vector machine-belief network architecture , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[23]  Oh-Wook Kwon,et al.  EMOTION RECOGNITION BY SPEECH SIGNAL , 2003 .

[24]  N. H. de Jong,et al.  Automatic measurement of speech rate in spoken Dutch , 2007 .

[25]  Klaus R. Scherer,et al.  Emotion dimensions and formant position , 2009, INTERSPEECH.

[26]  Kornel Laskowski,et al.  Combining Efforts for Improving Automatic Classification of Emotional User States , 2006 .

[27]  Ragini Verma,et al.  Class-level spectral features for emotion recognition , 2010, Speech Commun..

[28]  Urmila Shrawankar,et al.  Adverse Conditions and ASR Techniques for Robust Speech User Interface , 2013, ArXiv.

[29]  Chun Chen,et al.  Speech Emotion Recognition using an Enhanced Co-Training Algorithm , 2007, 2007 IEEE International Conference on Multimedia and Expo.

[30]  John C. Platt,et al.  Fast training of support vector machines using sequential minimal optimization, advances in kernel methods , 1999 .

[31]  Shrikanth S. Narayanan,et al.  Toward detecting emotions in spoken dialogs , 2005, IEEE Transactions on Speech and Audio Processing.

[32]  Cecilia Mascolo,et al.  EmotionSense: a mobile phones based adaptive platform for experimental social psychology research , 2010, UbiComp.

[33]  Vidhyasaharan Sethu,et al.  Empirical mode decomposition based weighted frequency feature for speech-based emotion classification , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[34]  Wendi B. Heinzelman,et al.  BaNa: A Noise Resilient Fundamental Frequency Detection Algorithm for Speech and Music , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[35]  Mohammed Yeasin,et al.  Robust Recognition of Emotion from Speech , 2006, IVA.

[36]  Ren-Hua Wang,et al.  HMM-Based Emotional Speech Synthesis Using Average Emotion Model , 2006, ISCSLP.

[37]  Roger Bakeman,et al.  Behavioral observation and coding. , 2000 .

[38]  Yi-Ping Phoebe Chen,et al.  Acoustic feature selection for automatic emotion recognition from speech , 2009, Inf. Process. Manag..

[39]  Rui Xia,et al.  Using i-Vector Space Model for Emotion Recognition , 2012, INTERSPEECH.

[40]  Klaus R. Scherer,et al.  The Role of Perceived Voice and Speech Characteristics in Vocal Emotion Communication , 2014 .

[41]  Mireia Farrús,et al.  Histogram Equalization in SVM Multimodal Person Verification , 2007, ICB.

[42]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[43]  Björn W. Schuller,et al.  Hidden Markov model-based speech emotion recognition , 2003, 2003 International Conference on Multimedia and Expo. ICME '03. Proceedings (Cat. No.03TH8698).

[44]  Li Lee,et al.  Speaker normalization using efficient frequency warping procedures , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[45]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[46]  Chang Dong Yoo,et al.  Loss-Scaled Large-Margin Gaussian Mixture Models for Speech Emotion Classification , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[47]  Chung-Hsien Wu,et al.  Two-Level Hierarchical Alignment for Semi-Coupled HMM-Based Audiovisual Emotion Recognition With Temporal Course , 2013, IEEE Transactions on Multimedia.

[48]  Shrikanth S. Narayanan,et al.  Combining acoustic and language information for emotion recognition , 2002, INTERSPEECH.

[49]  K. Scherer What are emotions? And how can they be measured? , 2005 .

[50]  Mehryar Mohri,et al.  A comparison of classifiers for detecting emotion from speech , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[51]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[52]  Tiago H. Falk,et al.  Automatic recognition of speech emotion using long-term spectro-temporal features , 2009, 2009 16th International Conference on Digital Signal Processing.

[53]  Gary Geunbae Lee,et al.  Emotion Recognition for Affective User Interfaces using Natural Language Dialogs , 2007, RO-MAN 2007 - The 16th IEEE International Symposium on Robot and Human Interactive Communication.

[54]  Thomas S. Huang,et al.  Emotion recognition from speech VIA boosted Gaussian mixture models , 2009, 2009 IEEE International Conference on Multimedia and Expo.

[55]  Fadi Al Machot,et al.  A novel real-time emotion detection system from audio streams based on Bayesian Quadratic Discriminate Classifier for ADAS , 2011, Proceedings of the Joint INDS'11 & ISTET'11.

[56]  Thomas Fang Zheng,et al.  Emotion attribute projection for speaker recognition on emotional speech , 2007, INTERSPEECH.

[57]  Edward Y. Chang,et al.  Class-Boundary Alignment for Imbalanced Dataset Learning , 2003 .

[58]  Engin Erzin,et al.  Multimodal Analysis of Upper-Body Gestures, Facial Expressions and Speech , 2012 .

[59]  Roddy Cowie,et al.  FEELTRACE: an instrument for recording perceived emotion in real time , 2000 .

[60]  Jerome R. Bellegarda,et al.  DATA‐DRIVEN ANALYSIS OF EMOTION IN TEXT USING LATENT AFFECTIVE FOLDING AND EMBEDDING , 2013, Comput. Intell..

[61]  Björn W. Schuller,et al.  Comparing one and two-stage acoustic modeling in the recognition of emotion in speech , 2007, 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU).

[62]  J. Canny,et al.  AMMON : A Speech Analysis Library for Analyzing Affect , Stress , and Mental Health on Mobile Phones , 2011 .