Balancing spoken content adaptation and unit length in the recognition of emotion and interest

Recognition and detection of non-lexical or paralinguistic cues from speech usually uses one general model per event (emotional state, level of interest). Commonly this model is trained independent of the phonetic structure. Given sufficient data, this approach seemingly works well enough. Yet, this paper addresses the question on which phonetic level there is the onset of emotions and level of interest. We therefore compare phoneme-, wordand sentence-level analysis for emotional sentence classification by use of a large prosodic, spectral, and voice quality feature space for SVM and MFCC for HMM/GMM. Experiments also take the necessity of ASR into account to select appropriate unit-models. In experiments on the well-known public EMO-DB database, and the SUSAS and AVIC spontaneous interest corpora, we found that the emotion recognition by sentence level analysis shows the best results. We discuss the implications of these types of analysis on the design of robust emotion and interest recognition of usable human-machine interfaces (HMI).

[1]  Astrid Paeschke,et al.  A database of German emotional speech , 2005, INTERSPEECH.

[2]  Yang Li,et al.  Recognizing emotions in speech using short-term and long-term features , 1998, ICSLP.

[3]  Alex Waibel,et al.  Detecting Emotions in Speech , 1998 .

[4]  Björn W. Schuller,et al.  Audiovisual recognition of spontaneous interest within conversations , 2007, ICMI '07.

[5]  John H. L. Hansen,et al.  Getting started with SUSAS: a speech under simulated and actual stress database , 1997, EUROSPEECH.

[6]  Lianhong Cai,et al.  Speech emotion classification with the combination of statistic features and temporal features , 2004, 2004 IEEE International Conference on Multimedia and Expo (ICME) (IEEE Cat. No.04TH8763).

[7]  Iain R. Murray,et al.  Toward the simulation of emotion in synthetic speech: a review of the literature on human vocal emotion. , 1993, The Journal of the Acoustical Society of America.

[8]  Björn W. Schuller,et al.  Timing levels in segment-based speech emotion recognition , 2006, INTERSPEECH.

[9]  Carlos Busso,et al.  Using neutral speech models for emotional speech analysis , 2007, INTERSPEECH.

[10]  Björn W. Schuller,et al.  Comparing one and two-stage acoustic modeling in the recognition of emotion in speech , 2007, 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU).

[11]  Kornel Laskowski,et al.  Combining Efforts for Improving Automatic Classification of Emotional User States , 2006 .

[12]  Björn W. Schuller,et al.  Hidden Markov model-based speech emotion recognition , 2003, 2003 International Conference on Multimedia and Expo. ICME '03. Proceedings (Cat. No.03TH8698).

[13]  Björn W. Schuller,et al.  Towards More Reality in the Recognition of Emotional Speech , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.