Investigation of Speaker Group-Dependent Modelling for Recognition of Affective States from Speech

For successful human–machine-interaction (HCI) the pure textual information and the individual skills, preferences, and affective states of the user must be known. Therefore, as a starting point, the user’s actual affective state has to be recognized. In this work we investigated how additional knowledge, for example age and gender of the user, can be used to improve recognition of affective state. Two methods from automatic speech recognition are used to incorporate age and gender differences in recognition of affective state: speaker group-dependent (SGD) modelling and vocal tract length normalisation (VTLN). The investigations were performed on four corpora with acted and natural affected speech. Different features and two methods of classification (Gaussian mixture models (GMMs) and multi-layer perceptrons (MLPs)) were used. In addition, the effects of channel compensation and contextual characteristics were analysed. The results are compared with our own baseline results and with results reported in the literature. Two hypotheses were tested. First, incorporation of age information further improves speaker group-dependent modelling. Second, acoustic normalization does not achieve the same improvement as achieved by speaker group-dependent modelling, because the age and gender of a speaker affects the way emotions are expressed.

[1]  Finnian Kelly,et al.  Effects of Long-Term Ageing on Speaker Verification , 2011, BIOID.

[2]  Dietmar F. Rösner,et al.  Intentionality in Interacting with Companion Systems - An Empirical Approach , 2011, HCI.

[3]  Hynek Hermansky,et al.  RASTA processing of speech , 1994, IEEE Trans. Speech Audio Process..

[4]  Fuji Ren,et al.  A Novel Emotion Recognizer from Speech Using Both Prosodic and Linguistic Features , 2011, KES.

[5]  Elisabeth André,et al.  Improving Automatic Emotion Recognition from Speech via Gender Differentiaion , 2006, LREC.

[6]  Roman Grundkiewicz,et al.  Automatic Extraction of Polish Language Errors from Text Edition History , 2013, TSD.

[7]  Shrikanth S. Narayanan,et al.  A review of the acoustic and linguistic properties of children's speech , 2007, 2007 IEEE 9th Workshop on Multimedia Signal Processing.

[8]  Christian Werner Becker-Asano,et al.  WASABI: Affect simulation for agents with believable interactivity , 2008 .

[9]  David L. Olson,et al.  Advanced Data Mining Techniques , 2008 .

[10]  Kinfe Tadesse Mengistu,et al.  Robust acoustic and semantic modeling in a telephone-based spoken dialog system , 2009 .

[11]  Zhihong Zeng,et al.  Audio–Visual Affective Expression Recognition Through Multistream Fused HMM , 2008, IEEE Transactions on Multimedia.

[12]  W. Wundt,et al.  Vorlesungen über die menschen- und tierseele , 1921 .

[13]  Maja Pantic,et al.  The SEMAINE corpus of emotionally coloured character interactions , 2010, 2010 IEEE International Conference on Multimedia and Expo.

[14]  K. Scherer,et al.  What determines a feeling's position in affective space? A case for appraisal , 2006 .

[15]  Tom E. Bishop,et al.  Blind Image Restoration Using a Block-Stationary Signal Model , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[16]  Ingo Siegert,et al.  Inter-rater reliability for emotion annotation in human–computer interaction: comparison and methodological improvements , 2013, Journal on Multimodal User Interfaces.

[17]  Friedhelm Schwenker,et al.  Multimodal Emotion Classification in Naturalistic User Behavior , 2011, HCI.

[18]  Shiqing Zhang,et al.  Audio-Visual Emotion Recognition Based on Facial Expression and Affective Speech , 2012, MMSP 2012.

[19]  Renata Franc,et al.  Age and gender differences in affect regulation strategies , 2009 .

[20]  Isabel Trancoso,et al.  Age and gender detection in the I-DASH project , 2011, TSLP.

[21]  Rok Gajsek,et al.  Gender and affect recognition based on GMM and GMM-UBM modeling with relevance MAP estimation , 2010, INTERSPEECH.

[22]  Diego H. Milone,et al.  Spoken emotion recognition using hierarchical classifiers , 2011, Comput. Speech Lang..

[23]  Montri Karnjanadecha,et al.  PITCH DETECTION ALGORITHM : AUTOCORRELATION METHOD AND AMDF , 2003 .

[24]  Zhigang Deng,et al.  Analysis of emotion recognition using facial expressions, speech and multimodal information , 2004, ICMI '04.

[25]  Jon D. Morris Observations: SAM: The Self-Assessment Manikin An Efficient Cross-Cultural Measurement Of Emotional Response 1 , 1995 .

[26]  Björn W. Schuller,et al.  Recognising realistic emotions and affect in speech: State of the art and lessons learnt from the first challenge , 2011, Speech Commun..

[27]  Loïc Kessous,et al.  Whodunnit - Searching for the most important feature types signalling emotion-related user states in speech , 2011, Comput. Speech Lang..

[28]  Hugo Van hamme,et al.  Speaker age estimation using Hidden Markov Model weight supervectors , 2012, 2012 11th International Conference on Information Science, Signal Processing and their Applications (ISSPA).

[29]  A. Al-Hamadi,et al.  Multimodal affect recognition in spontaneous HCI environment , 2012, 2012 IEEE International Conference on Signal Processing, Communication and Computing (ICSPCC 2012).

[30]  Björn Schuller,et al.  Prosodic , Spectral or Voice Quality ? Feature Type Relevance for the Discrimination of Emotion Pairs , 2008 .

[31]  Kostas Karpouzis,et al.  The HUMAINE Database: Addressing the Collection and Annotation of Naturalistic and Induced Emotional Data , 2007, ACII.

[32]  Ioannis Pitas,et al.  The eNTERFACE’05 Audio-Visual Emotion Database , 2006, 22nd International Conference on Data Engineering Workshops (ICDEW'06).

[33]  Kazuhiko Takahashi Remarks on Emotion Recognition from Bio-Potential Signals , 2004 .

[34]  E. S. Dmitrieva,et al.  The Relationship between the Perception of Emotional Intonation of Speech in Conditions of Interference and the Acoustic Parameters of Speech Signals in Adults of Different Gender and Age , 2012, Neuroscience and Behavioral Physiology.

[35]  Adrian Leemann,et al.  Speaker idiosyncratic rhythmic features in the speech signal , 2012, INTERSPEECH.

[36]  Shrikanth S. Narayanan,et al.  The Vera am Mittag German audio-visual emotional speech database , 2008, 2008 IEEE International Conference on Multimedia and Expo.

[37]  Koichi Shinoda,et al.  Rapid vocal tract length normalization using maximum likelihood estimation , 2001, INTERSPEECH.

[38]  Benoit Huet,et al.  Towards multimodal emotion recognition: a new approach , 2010, CIVR '10.

[39]  Jordan Cohen,et al.  Vocal tract normalization in speech recognition: Compensating for systematic speaker variability , 1995 .

[40]  Naomi Harte,et al.  Feature sets for automatic classification of dimensional affect , 2012 .

[41]  Andreas Wendemuth,et al.  Determining optimal signal features and parameters for HMM-based emotion classification , 2010, Melecon 2010 - 2010 15th IEEE Mediterranean Electrotechnical Conference.

[42]  L. Carstensen,et al.  Emotion and aging: experience, expression, and control. , 1997, Psychology and aging.

[43]  B. Atal Effectiveness of linear prediction characteristics of the speech wave for automatic speaker identification and verification. , 1974, The Journal of the Acoustical Society of America.

[44]  Felix Burkhardt,et al.  A Database of Age and Gender Annotated Telephone Speech , 2010, LREC.

[45]  Björn Schuller,et al.  Being bored? Recognising natural interest by extensive audiovisual integration for real-life application , 2009, Image Vis. Comput..

[46]  Elmar Nöth,et al.  “You Stupid Tin Box” - Children Interacting with the AIBO Robot: A Cross-linguistic Emotional Speech Corpus , 2004, LREC.

[47]  K. Scherer,et al.  Appraisal processes in emotion: Theory, methods, research. , 2001 .

[48]  Elmar Nöth,et al.  Age and gender recognition for telephone applications based on GMM supervectors and support vector machines , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[49]  W. Mcdougall An introduction to social psychology , 1909 .

[50]  Robert I. Damper,et al.  On Acoustic Emotion Recognition: Compensating for Covariate Shift , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[51]  C. Cullen,et al.  Task-Based Mood Induction Procedures for the Elicitation of Natural Emotional Responses. , 2007 .

[52]  Lisa D. Butler,et al.  Gender differences in responses to depressed mood in a college sample , 1994 .

[53]  Haizhou Li,et al.  An overview of text-independent speaker recognition: From features to supervectors , 2010, Speech Commun..

[54]  Stan Davis,et al.  Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Se , 1980 .

[55]  Valiantsina Hubeika,et al.  Estimation of Gender and Age from Recorded Speech , 2006 .

[56]  Ingo Siegert,et al.  Appropriate emotional labelling of non-acted speech using basic emotions, geneva emotion wheel and self assessment manikins , 2011, 2011 IEEE International Conference on Multimedia and Expo.

[57]  Günther Palm,et al.  Towards Emotion Recognition in Human Computer Interaction , 2012, WIRN.

[58]  Astrid Paeschke,et al.  A database of German emotional speech , 2005, INTERSPEECH.

[59]  K. Scherer Appraisal considered as a process of multilevel sequential checking. , 2001 .

[60]  Andreas Wendemuth,et al.  Companion-Technology for Cognitive Technical Systems , 2011, KI - Künstliche Intelligenz.

[61]  Roddy Cowie,et al.  Describing the emotional states that are expressed in speech , 2003, Speech Commun..

[62]  R. Plutchik Emotion, a psychoevolutionary synthesis , 1980 .

[63]  Aaron E. Rosenberg,et al.  A comparative performance study of several pitch detection algorithms , 1976 .

[64]  Elmar Nöth,et al.  How to find trouble in communication , 2003, Speech Commun..

[65]  Roddy Cowie,et al.  Multimodal databases of everyday emotion: facing up to complexity , 2005, INTERSPEECH.

[66]  Shrikanth S. Narayanan,et al.  Analysis of children's speech: duration, pitch and formants , 1997, EUROSPEECH.

[67]  Van Nostrand,et al.  Error Bounds for Convolutional Codes and an Asymptotically Optimum Decoding Algorithm , 1967 .

[68]  Friedhelm Schwenker,et al.  Using speaker group dependent modelling to improve fusion of fragmentary classifier decisions , 2013, 2013 IEEE International Conference on Cybernetics (CYBCO).

[69]  Ingo Siegert,et al.  Emotion Detection in HCI: From Speech Features to Emotion Space , 2013, IFAC HMS.

[70]  Li Lee,et al.  A frequency warping approach to speaker normalization , 1998, IEEE Trans. Speech Audio Process..

[71]  Florian Schiel,et al.  Automatic Phonetic Transcription of Non-Prompted Speech , 1999 .

[72]  B. Schuller,et al.  The Role of Prosody in Affective Speech, Linguistic Insights, Studies in Language and Communication , 2009 .

[73]  Lukás Burget,et al.  Application of speaker- and language identification state-of-the-art techniques for emotion recognition , 2011, Speech Commun..

[74]  Lou Boves,et al.  On the efficiency of classical RASTA filtering for continuous speech recognition: Keeping the balance between acoustic pre-processing and acoustic modelling , 2003, Speech Commun..

[75]  P. De Silva,et al.  Handbook of Cognition and Emotion , 2001 .

[76]  Björn W. Schuller,et al.  The INTERSPEECH 2009 emotion challenge , 2009, INTERSPEECH.

[77]  D. Massaro,et al.  Perceiving affect from the voice and the face , 1996, Psychonomic bulletin & review.

[78]  Björn W. Schuller,et al.  Acoustic emotion recognition: A benchmark comparison of performances , 2009, 2009 IEEE Workshop on Automatic Speech Recognition & Understanding.

[79]  Elmar Nöth,et al.  Quantification of Segmentation and F0 Errors and Their Effect on Emotion Recognition , 2008, TSD.

[80]  Ming Li,et al.  Combining five acoustic level modeling methods for automatic speaker age and gender recognition , 2010, INTERSPEECH.

[81]  Zhihong Zeng,et al.  A Survey of Affect Recognition Methods: Audio, Visual, and Spontaneous Expressions , 2009, IEEE Trans. Pattern Anal. Mach. Intell..

[82]  Shashidhar G. Koolagudi,et al.  Emotion recognition from speech using global and local prosodic features , 2013, Int. J. Speech Technol..

[83]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[84]  Andreas Wendemuth,et al.  Intraindividual and interindividual multimodal emotion analyses in Human-Machine-Interaction , 2012, 2012 IEEE International Multi-Disciplinary Conference on Cognitive Methods in Situation Awareness and Decision Support.

[85]  S. Wegmann,et al.  Speaker normalization on conversational telephone speech , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[86]  Andrew Rosenberg,et al.  Classifying Skewed Data: Importance Weighting to Optimize Average Recall , 2012, INTERSPEECH.

[87]  Andreas Wendemuth,et al.  Segmented-Memory Recurrent Neural Networks versus Hidden Markov Models in Emotion Recognition from Speech , 2011, IJCCI.

[88]  Alex Waibel,et al.  Vocal Tract Length Normalization for Large Vocabulary Continuous Speech Recognition , 1997 .

[89]  David A. van Leeuwen,et al.  Speech-based recognition of self-reported and observed emotion in a dimensional space , 2012, Speech Commun..

[90]  Carlos Busso,et al.  Interpreting ambiguous emotional expressions , 2009, 2009 3rd International Conference on Affective Computing and Intelligent Interaction and Workshops.

[91]  Ingo Siegert,et al.  Towards Emotion and Affect Detection in the Multimodal LAST MINUTE Corpus , 2012, LREC.

[92]  Dietmar F. Rösner,et al.  LAST MINUTE: a Multimodal Corpus of Speech-based User-Companion Interactions , 2012, LREC.

[93]  Björn W. Schuller,et al.  Towards More Reality in the Recognition of Emotional Speech , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[94]  Shrikanth S. Narayanan,et al.  Automatic speaker age and gender recognition using acoustic and prosodic level information fusion , 2013, Comput. Speech Lang..

[95]  Pierre Dumouchel,et al.  Cepstral and long-term features for emotion recognition , 2009, INTERSPEECH.

[96]  Lakhmi C. Jain,et al.  Knowledge-Based Intelligent Information and Engineering Systems , 2004, Lecture Notes in Computer Science.

[97]  Ian R. Fasel,et al.  A learning approach to hierarchical feature selection and aggregation for audio classification , 2010, Pattern Recognit. Lett..

[98]  Li Lee,et al.  Speaker normalization using efficient frequency warping procedures , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[99]  Friedhelm Schwenker,et al.  Multiple Classifier Systems for the Recogonition of Human Emotions , 2010, MCS.

[100]  Diego Giuliani,et al.  Investigating recognition of children's speech , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[101]  Human-Computer Interaction. Towards Mobile and Intelligent Interaction Environments , 2011, Lecture Notes in Computer Science.

[102]  David A. van Leeuwen,et al.  Assessing agreement of observer- and self-annotations in spontaneous multimodal emotion data , 2008, INTERSPEECH.

[103]  Douglas D. O'Shaughnessy,et al.  Robust gender-dependent acoustic-phonetic modelling in continuous speech recognition based on a new automatic male/female classification , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[104]  K. Fischer,et al.  DESPERATELY SEEKING EMOTIONS OR: ACTORS, WIZARDS, AND HUMAN BEINGS , 2000 .

[105]  Vitomir Štruc,et al.  Towards Efficient Multi-Modal Emotion Recognition , 2013 .

[106]  Eddie Wong,et al.  Utilise Vocal Tract Length Normalisation for Robust Automatic Language Identification , 2002 .

[107]  J. Gabrieli,et al.  Gender Differences in Emotion Regulation: An fMRI Study of Cognitive Reappraisal , 2008, Group processes & intergroup relations : GPIR.

[108]  Keun-Chang Kwak,et al.  Performance Comparison of Gender and Age Group Recognition for Human-Robot Interaction , 2012 .

[109]  Dietmar F. Rösner,et al.  On the Role of the NIMITEK Corpus in Developing an Emotion Adaptive Spoken Dialogue System , 2008, LREC.