Determining the Smallest Emotional Unit for Level of Arousal Classification

Most state-of-the-art emotion recognition methods are based on turn- and frame-level analysis independent from phonetic transcription. Currently "affective computing" community could not specify the smallest emotional standard unit which can be easily classified and determined by any "advanced" and "non-advanced" listener. It is known that, acoustic modeling on the smallest phonetic unit (phoneme) started a new era in automatic speech recognition: switch from speaker dependent isolated word recognition to speaker independent continuous speech recognition. In or current research we showed that phoneme can be used as as smallest unit for high and low arousal emotion classification task. We trained our classifications models on the VAM dataset material and evaluated them on speech samples from the DES dataset. For our experiments we employed two different emotion classification approaches: general (phonetic pattern independent) and phoneme-based (phonetic pattern dependent). Both classification approaches used MFFC features extracted on the frame level. Our experimental results impressively show that the proposed phoneme-based classification technique could increase emotion classification performance by about 9.68% absolute (15.98% relative). We showed that phoneme-level emotion models trained on "natural" emotions could provide impressive classification performance on dataset with acted affective content.

[1]  J. G. Taylor,et al.  Emotion recognition in human-computer interaction , 2005, Neural Networks.

[2]  Andreas Wendemuth,et al.  Modeling phonetic pattern variability in favor of the creation of robust emotion classifiers for real-life applications , 2014, Comput. Speech Lang..

[3]  Björn W. Schuller,et al.  On the Influence of Phonetic Content Variation for Acoustic Emotion Recognition , 2008, PIT.

[4]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[5]  Klaus R. Scherer,et al.  Emotion dimensions and formant position , 2009, INTERSPEECH.

[6]  Werner Verhelst,et al.  Automatic Classification of Expressiveness in Speech: A Multi-corpus Study , 2007, Speaker Classification.

[7]  George N. Votsis,et al.  Emotion recognition in human-computer interaction , 2001, IEEE Signal Process. Mag..

[8]  Andreas Wendemuth,et al.  Processing affected speech within human machine interaction , 2009, INTERSPEECH.

[9]  K. Scherer,et al.  Acoustic profiles in vocal emotion expression. , 1996, Journal of personality and social psychology.

[10]  Björn W. Schuller,et al.  Acoustic emotion recognition: A benchmark comparison of performances , 2009, 2009 IEEE Workshop on Automatic Speech Recognition & Understanding.

[11]  Jennifer Healey,et al.  Toward Machine Emotional Intelligence: Analysis of Affective Physiological State , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[12]  Andreas Wendemuth,et al.  Heading toward to the natural way of human-machine interaction: the nimitek project , 2009, 2009 IEEE International Conference on Multimedia and Expo.

[13]  Björn W. Schuller,et al.  Combining speech recognition and acoustic word emotion models for robust text-independent emotion recognition , 2008, 2008 IEEE International Conference on Multimedia and Expo.

[14]  Florian Schiel,et al.  The SmartKom Multimodal Corpus at BAS , 2002, LREC.

[15]  David Philippou-Hübner,et al.  Vowels Formants Analysis Allows Straightforward Detection of High Arousal Acted and Spontaneous Emotions , 2011, INTERSPEECH.

[16]  Björn W. Schuller,et al.  Segmenting into Adequate Units for Automatic Recognition of Emotion-Related Episodes: A Speech-Based Approach , 2010, Adv. Hum. Comput. Interact..

[17]  Michael D. Robinson,et al.  Measures of emotion: A review , 2009, Cognition & emotion.

[18]  Zhigang Deng,et al.  Emotion recognition based on phoneme classes , 2004, INTERSPEECH.

[19]  Andreas Wendemuth,et al.  Towards Robust Spontaneous Speech Recognition with Emotional Speech Adapted Acoustic Models , 2012 .

[20]  Rosalind W. Picard Affective computing: challenges , 2003, Int. J. Hum. Comput. Stud..

[21]  Claude Montacié,et al.  Combining Multiple Phoneme-Based Classifiers with Audio Feature-Based Classifier for the Detection of Alcohol Intoxication , 2011, INTERSPEECH.

[22]  Carlos Busso,et al.  Using neutral speech models for emotional speech analysis , 2007, INTERSPEECH.

[23]  Hans G. Tillmann,et al.  The Phondat-verbmobil speech corpus , 1995, EUROSPEECH.

[24]  Björn W. Schuller,et al.  Unsupervised learning in cross-corpus acoustic emotion recognition , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.

[25]  Fakhri Karray,et al.  Survey on speech emotion recognition: Features, classification schemes, and databases , 2011, Pattern Recognit..

[26]  Shrikanth S. Narayanan,et al.  Primitives-based evaluation and estimation of emotions in speech , 2007, Speech Commun..

[27]  Paul Dalsgaard,et al.  Design, recording and verification of a danish emotional speech database , 1997, EUROSPEECH.

[28]  Klaus J. Kohler,et al.  Labelled data bank of spoken standard German: the Kiel corpus of read/spontaneous speech , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[29]  Jan Larsen,et al.  Combining semantic and acoustic features for valence and arousal recognition in speech , 2012, 2012 3rd International Workshop on Cognitive Information Processing (CIP).

[30]  Shrikanth S. Narayanan,et al.  The Vera am Mittag German audio-visual emotional speech database , 2008, 2008 IEEE International Conference on Multimedia and Expo.

[31]  Björn W. Schuller,et al.  Recognising realistic emotions and affect in speech: State of the art and lessons learnt from the first challenge , 2011, Speech Commun..

[32]  Rok Gajsek,et al.  Speaker state recognition using an HMM-based feature extraction method , 2013, Comput. Speech Lang..

[33]  Carlos Busso,et al.  Investigating the role of phoneme-level modifications in emotional speech resynthesis , 2005, INTERSPEECH.

[34]  Hermann Ney,et al.  Joint-sequence models for grapheme-to-phoneme conversion , 2008, Speech Commun..