Recognising interest in conversational speech - comparing bag of frames and supra-segmental features

It is common knowledge that affective and emotion-related states are acoustically well modelled on a supra-segmental level. Nonetheless successes are reported for frame-level processing either by means of dynamic classification or multi-instance learning techniques. In this work a quantitative feature-type-wise comparison between frame-level and supra-segmental analysis is carried out for the recognition of interest in human conversational speech. To shed light on the respective differences the same classifier, namely Support-Vector-Machines, is used in both cases: once by clustering a ‘bag of frames’ of unknown sequence length employing Multi-Instance Learning techniques, and once by statistical functional application for the projection of the time series onto a static feature vector. As database serves the Audiovisual Interest Corpus of naturalistic interest. Index Terms: interest recognition, multi-instance learning, feature relevance

[1]  Björn W. Schuller,et al.  Recognition of interest in human conversational speech , 2006, INTERSPEECH.

[2]  Albino Nogueiras,et al.  Speech emotion recognition using hidden Markov models , 2001, INTERSPEECH.

[3]  Björn W. Schuller,et al.  Abandoning emotion classes - towards continuous emotion recognition with modelling of long-range dependencies , 2008, INTERSPEECH.

[4]  Björn Schuller,et al.  Being bored? Recognising natural interest by extensive audiovisual integration for real-life application , 2009, Image Vis. Comput..

[5]  Björn W. Schuller,et al.  The INTERSPEECH 2009 emotion challenge , 2009, INTERSPEECH.

[6]  Björn W. Schuller,et al.  Hidden Markov model-based speech emotion recognition , 2003, 2003 International Conference on Multimedia and Expo. ICME '03. Proceedings (Cat. No.03TH8698).

[7]  Johannes Wagner,et al.  A Systematic Comparison of Different HMM Designs for Emotion Recognition from Acted and Spontaneous Speech , 2007, ACII.

[8]  Björn W. Schuller,et al.  Mothers, adults, children, pets — towards the acoustics of intimacy , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[9]  Zeynep Inanoglu,et al.  Emotive alert: HMM-based emotion detection in voicemail messages , 2005, IUI '05.

[10]  Björn W. Schuller,et al.  Combining frame and turn-level information for robust recognition of emotions within speech , 2007, INTERSPEECH.

[11]  Werner Verhelst,et al.  Automatic Classification of Expressiveness in Speech: A Multi-corpus Study , 2007, Speaker Classification.

[12]  Thomas Hofmann,et al.  Support Vector Machines for Multiple-Instance Learning , 2002, NIPS.

[13]  Björn W. Schuller,et al.  Does affect affect automatic recognition of children2s speech? , 2008, WOCCI.

[14]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[15]  Ailbhe Ní Chasaide,et al.  The role of voice quality in communicating emotion, mood and attitude , 2003, Speech Commun..

[16]  Dirk Heylen,et al.  Towards responsive Sensitive Artificial Listeners , 2008 .