Multi classifier systems and forward backward feature selection algorithms to classify emotional coloured speech

Systems for the recognition of psychological characteristics such as the emotional state in real world scenarios have to deal with several difficulties. Amongst those are unconstrained environments and uncertainties in one or several input channels. However a more crucial aspect is the content of the data itself. Psychological states are highly person-dependent and often even humans are not able to determine the correct state a person is in. A successful recognition system thus has to deal with data, that is not very discriminative and often simply misleading. In order to succeed, a critical view on features and decisions is essential to select only the most valuable ones. This work presents a comparison of a common multi classifier system approach based on state of the art features and a modified forward backward feature selection algorithm with a long term stopping criteria. The second approach takes also features of the voice quality family into account. Both approaches are based on the audio modality only. The dataset used in the challenge is an in between dataset of real world datasets which are still very hard to handle and over acted datasets which were famous in the past and today are well understood.

[1]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[2]  Albino Nogueiras,et al.  Speech emotion recognition using hidden Markov models , 2001, INTERSPEECH.

[3]  Björn W. Schuller,et al.  AVEC 2013: the continuous audio/visual emotion and depression recognition challenge , 2013, AVEC@ACM Multimedia.

[4]  Mann Oo. Hay Emotion recognition in human-computer interaction , 2012 .

[5]  M. Lugger,et al.  Classification of different speaking groups ITG Fachtagung Sprachkommunikation 2006 CLASSIFICATION OF DIFFERENT SPEAKING GROUPS BY MEANS OF VOICE QUALITY PARAMETERS , 2011 .

[6]  Zhigang Deng,et al.  Emotion recognition based on phoneme classes , 2004, INTERSPEECH.

[7]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[8]  Friedhelm Schwenker,et al.  Multimodal Emotion Classification in Naturalistic User Behavior , 2011, HCI.

[9]  J. Liljencrants,et al.  Dept. for Speech, Music and Hearing Quarterly Progress and Status Report a Four-parameter Model of Glottal Flow , 2022 .

[10]  Bin Yang,et al.  The Relevance of Voice Quality Features in Speaker Independent Emotion Recognition , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[11]  Misha Pavel,et al.  On the relative importance of various components of the modulation spectrum for automatic speech recognition , 1999, Speech Commun..

[12]  Inma Hernáez,et al.  Feature Analysis and Evaluation for Automatic Emotion Identification in Speech , 2010, IEEE Transactions on Multimedia.

[13]  Sascha Meudt,et al.  On Instance Selection in Audio Based Emotion Recognition , 2012, ANNPR.

[14]  John Kane,et al.  Wavelet Maxima Dispersion for Breathy to Tense Voice Discrimination , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[15]  Takeo Kanade,et al.  Comprehensive database for facial expression analysis , 2000, Proceedings Fourth IEEE International Conference on Automatic Face and Gesture Recognition (Cat. No. PR00580).

[16]  Paavo Alku,et al.  Comparison of multiple voice source parameters in different phonation types , 2007, INTERSPEECH.

[17]  Friedhelm Schwenker,et al.  Multiple Classifier Systems for the Recogonition of Human Emotions , 2010, MCS.

[18]  Günther Palm,et al.  Emotion Recognition from Speech Using Multi-Classifier Systems and RBF-Ensembles , 2008, Speech, Audio, Image and Biomedical Signal Processing using Neural Networks.

[19]  Hynek Hermansky,et al.  RASTA-PLP speech analysis technique , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[20]  Günther Palm,et al.  Sensor-Fusion in Neural Networks , 2009 .

[21]  G. Palm,et al.  Classifier fusion for emotion recognition from speech , 2007 .

[22]  P. Alku,et al.  Normalized amplitude quotient for parametrization of the glottal flow. , 2002, The Journal of the Acoustical Society of America.

[23]  Tamás D. Gedeon,et al.  Collecting Large, Richly Annotated Facial-Expression Databases from Movies , 2012, IEEE MultiMedia.

[24]  Astrid Paeschke,et al.  A database of German emotional speech , 2005, INTERSPEECH.

[25]  Friedhelm Schwenker,et al.  Investigating fuzzy-input fuzzy-output support vector machines for robust voice quality classification , 2013, Comput. Speech Lang..

[26]  Ryohei Nakatsu,et al.  Emotion Recognition in Speech Using Neural Networks , 2000, Neural Computing & Applications.

[27]  K. Scherer,et al.  Vocal expression of emotion. , 2003 .

[28]  Thierry Dutoit,et al.  Causal-anticausal decomposition of speech using complex cepstrum for glottal source estimation , 2011, Speech Commun..

[29]  Günther Palm,et al.  Real-Time Emotion Recognition Using Echo State Networks , 2008, PIT.

[30]  J. G. Taylor,et al.  Emotion recognition in human-computer interaction , 2005, Neural Networks.

[31]  John Kane,et al.  Identifying Regions of Non-Modal Phonation Using Features of the Wavelet Transform , 2011, INTERSPEECH.

[32]  N. Frijda Recognition of Emotion , 1969 .

[33]  S. Schachter The Interaction of Cognitive and Physiological Determinants of Emotional State , 1964 .