Prosodic, Spectral and Voice Quality Feature Selection Using a Long-Term Stopping Criterion for Audio-Based Emotion Recognition

Emotion recognition from speech is an important field of research in human-machine-interfaces, and has begun to influence everyday life by employment in different areas such as call centers or wearable companions in the form of smartphones. In the proposed classification architecture, different spectral, prosodic and the relatively novel voice quality features are extracted from the speech signals. These features are then used to represent long-term information of the speech, leading to utterance-wise suprasegmental features. The most promising of these features are selected using a forward-selection/backward-elimination algorithm with a novel long-term termination criterion for the selection. The overall system has been evaluated using recordings from the public Berlin emotion database. Utilizing the resulted features, a recognition rate of 88,97% has been achieved which surpasses the performance of humans on this database and is comparable to the state of the art performance on this dataset.

[1]  Zhigang Deng,et al.  Emotion recognition based on phoneme classes , 2004, INTERSPEECH.

[2]  G. Palm,et al.  Classifier fusion for emotion recognition from speech , 2007 .

[3]  P. Alku,et al.  Normalized amplitude quotient for parametrization of the glottal flow. , 2002, The Journal of the Acoustical Society of America.

[4]  Inma Hernáez,et al.  Feature Analysis and Evaluation for Automatic Emotion Identification in Speech , 2010, IEEE Transactions on Multimedia.

[5]  Friedhelm Schwenker,et al.  Multiple Classifier Systems for the Recogonition of Human Emotions , 2010, MCS.

[6]  Bin Yang,et al.  The Relevance of Voice Quality Features in Speaker Independent Emotion Recognition , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[7]  J. Russell,et al.  Evidence for a three-factor theory of emotions , 1977 .

[8]  Sascha Meudt,et al.  Fusion of Audio-visual Features using Hierarchical Classifier Systems for the Recognition of Affective States and the State of Depression , 2014, ICPRAM.

[9]  M. Lugger,et al.  Classification of different speaking groups ITG Fachtagung Sprachkommunikation 2006 CLASSIFICATION OF DIFFERENT SPEAKING GROUPS BY MEANS OF VOICE QUALITY PARAMETERS , 2011 .

[10]  Günther Palm,et al.  Emotion Recognition from Speech Using Multi-Classifier Systems and RBF-Ensembles , 2008, Speech, Audio, Image and Biomedical Signal Processing using Neural Networks.

[11]  Günther Palm,et al.  Sensor-Fusion in Neural Networks , 2009 .

[12]  Markus Kächele,et al.  Semi-Supervised Dictionary Learning of Sparse Representations for Emotion Recognition , 2013, PSL.

[13]  Astrid Paeschke,et al.  A database of German emotional speech , 2005, INTERSPEECH.

[14]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[15]  John Kane,et al.  Wavelet Maxima Dispersion for Breathy to Tense Voice Discrimination , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[16]  Günther Palm,et al.  Combination of sequential class distributions from multiple channels using Markov fusion networks , 2014, Journal on Multimodal User Interfaces.

[17]  Thierry Dutoit,et al.  Causal-anticausal decomposition of speech using complex cepstrum for glottal source estimation , 2011, Speech Commun..

[18]  J. Liljencrants,et al.  Dept. for Speech, Music and Hearing Quarterly Progress and Status Report a Four-parameter Model of Glottal Flow , 2022 .

[19]  Lori Lamel,et al.  Challenges in real-life emotion annotation and machine learning based detection , 2005, Neural Networks.

[20]  Mann Oo. Hay Emotion recognition in human-computer interaction , 2012 .

[21]  Wonyong Sung,et al.  A statistical model-based voice activity detection , 1999, IEEE Signal Processing Letters.

[22]  Albino Nogueiras,et al.  Speech emotion recognition using hidden Markov models , 2001, INTERSPEECH.

[23]  Sascha Meudt,et al.  On Instance Selection in Audio Based Emotion Recognition , 2012, ANNPR.

[24]  Paavo Alku,et al.  Comparison of multiple voice source parameters in different phonation types , 2007, INTERSPEECH.

[25]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[26]  Markus Kächele,et al.  Using unlabeled data to improve classification of emotional states in human computer interaction , 2013, Journal on Multimodal User Interfaces.

[27]  Honglak Lee,et al.  Unsupervised feature learning for audio classification using convolutional deep belief networks , 2009, NIPS.

[28]  Pierre-Yves Oudeyer,et al.  Erratum to: "The production and recognition of emotions in speech: features and algorithms": [Int. J. Hum.-Comput. Stud 59 (2003) 157] , 2005, Int. J. Hum. Comput. Stud..

[29]  Ron Kohavi,et al.  Wrappers for Feature Subset Selection , 1997, Artif. Intell..

[30]  Günther Palm,et al.  Comparison of Multiclass SVM Decomposition Schemes for Visual Object Recognition , 2005, DAGM-Symposium.

[31]  Chih-Jen Lin,et al.  A comparison of methods for multiclass support vector machines , 2002, IEEE Trans. Neural Networks.

[32]  Ryohei Nakatsu,et al.  Emotion Recognition in Speech Using Neural Networks , 2000, Neural Computing & Applications.

[33]  Björn W. Schuller,et al.  OpenEAR — Introducing the munich open-source emotion and affect recognition toolkit , 2009, 2009 3rd International Conference on Affective Computing and Intelligent Interaction and Workshops.

[34]  J. G. Taylor,et al.  Emotion recognition in human-computer interaction , 2005, Neural Networks.

[35]  John Kane,et al.  Identifying Regions of Non-Modal Phonation Using Features of the Wavelet Transform , 2011, INTERSPEECH.

[36]  Günther Palm,et al.  Real-Time Emotion Recognition from Speech Using Echo State Networks , 2008, ANNPR.

[37]  Friedhelm Schwenker,et al.  Automatic emotion classification vs. human perception: Comparing machine performance to the human benchmark , 2012, 2012 11th International Conference on Information Science, Signal Processing and their Applications (ISSPA).

[38]  Friedhelm Schwenker,et al.  Investigating fuzzy-input fuzzy-output support vector machines for robust voice quality classification , 2013, Comput. Speech Lang..

[39]  Björn W. Schuller,et al.  Recognising realistic emotions and affect in speech: State of the art and lessons learnt from the first challenge , 2011, Speech Commun..

[40]  Pierre-Yves Oudeyer,et al.  The production and recognition of emotions in speech: features and algorithms , 2003, Int. J. Hum. Comput. Stud..

[41]  Björn W. Schuller,et al.  Deep neural networks for acoustic emotion recognition: Raising the benchmarks , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).