Applied to Human Centered Interaction Analysis

Emotional speech characterization is an important issue for the understanding of interaction. This article discusses the time-scale analysis problem in feature extraction for emotional speech processing. We describe a computational framework for combining segmental and supra-segmental features for emotional speech detection. The statistical fusion is based on the estimation of local a posteriori class probabilities and the overall decision employs weighting factors directly related to the duration of the individual speech segments. This strategy is applied to a real-world application: detection of Italian motherese in authentic and longitudinal parent-infant interaction at home. The results suggest that short- and long-term infor- mation, respectively, represented by the short-term spec- trum and the prosody parameters (fundamental frequency and energy) provide a robust and efficient time-scale analysis. A similar fusion methodology is also investigated by the use of a phonetic-specific characterization process. This strategy is motivated by the fact that there are varia- tions across emotional states at the phoneme level. A time- scale based on both vowels and consonants is proposed and it provides a relevant and discriminant feature space for acted emotion recognition. The experimental results on two different databases Berlin (German) and Aholab (Basque) show that the best performance are obtained by our pho- neme-dependent approach. These findings demonstrate the relevance of taking into account phoneme dependency (vowels/consonants) for emotional speech characterization.

[1]  Catherine I. Watson,et al.  Some acoustic characteristics of emotion , 1998, ICSLP.

[2]  A. Fernald,et al.  Intonation and communicative intent in mothers' speech to infants: is the melody the message? , 1989, Child development.

[3]  A. Kendon,et al.  Organization of behavior in face-to-face interaction , 1975 .

[4]  David G. Stork,et al.  Pattern Classification (2nd ed.) , 1999 .

[5]  Fabien Ringeval,et al.  Exploiting a Vowel Based Approach for Acted Emotion Recognition , 2008, COST 2102 Workshop.

[6]  Werner Verhelst,et al.  An evaluation of the robustness of existing supervised machine learning approaches to the classification of emotions in speech , 2007, Speech Commun..

[7]  F. Ramus,et al.  Correlates of linguistic rhythm in the speech signal , 1999, Cognition.

[8]  Cynthia Breazeal,et al.  Characterizing and Processing Robot-Directed Speech , 2001 .

[9]  Nick Campbell,et al.  On the Use of NonVerbal Speech Sounds in Human Communication , 2007, COST 2102 Workshop.

[10]  Maria Uther,et al.  Do you speak E-NG-L-I-SH? A comparison of foreigner- and infant-directed speech , 2007, Speech Commun..

[11]  Marcos Faúndez-Zanuy,et al.  Investigation on LP-residual representations for speaker identification , 2009, Pattern Recognit..

[12]  F. Ramus,et al.  The role of speech rhythm in language discrimination: further tests with a non-human primate. , 2005, Developmental science.

[13]  Douglas A. Reynolds,et al.  Speaker identification and verification using Gaussian mixture speaker models , 1995, Speech Commun..

[14]  Marcos Faúndez-Zanuy,et al.  Maximum likelihood linear programming data fusion for speaker recognition , 2009, Speech Commun..

[15]  David G. Stork,et al.  Pattern Classification , 1973 .

[16]  Maja Pantic,et al.  Social signal processing: Survey of an emerging domain , 2009, Image Vis. Comput..

[17]  Rosalind W. Picard Affective Computing , 1997 .

[18]  Cynthia Breazeal,et al.  Recognition of Affective Communicative Intent in Robot-Directed Speech , 2002, Auton. Robots.

[19]  Björn W. Schuller,et al.  Frame vs. Turn-Level: Emotion Recognition from Speech Considering Static and Dynamic Processing , 2007, ACII.

[20]  E. Grabe,et al.  Durational variability in speech and the rhythm class hypothesis , 2005 .

[21]  Alex Pentland,et al.  Social signals, their function, and automatic analysis: a survey , 2008, ICMI '08.

[22]  D. Burnham,et al.  What's New, Pussycat? On Talking to Babies and Animals , 2002, Science.

[23]  I. Linnankoski,et al.  Expression or emotional-motivational connotations with a one-word utterance. , 1997, The Journal of the Acoustical Society of America.

[24]  M. Argyle Bodily communication, 2nd ed. , 1988 .

[25]  Filippo Muratori,et al.  Early Behavioral Development in Autistic Children: The First 2 Years of Life through Home Movies , 2001, Psychopathology.

[26]  Björn W. Schuller,et al.  Hidden Markov model-based speech emotion recognition , 2003, 2003 International Conference on Multimedia and Expo. ICME '03. Proceedings (Cat. No.03TH8698).

[27]  Subhash C. Bagui,et al.  Combining Pattern Classifiers: Methods and Algorithms , 2005, Technometrics.

[28]  Mohamed Chetouani,et al.  Motherese detection based on segmental and supra-segmental features , 2008, 2008 19th International Conference on Pattern Recognition.

[29]  Filippo Muratori,et al.  AUTISM AD A DOWNSTREAM EFFECT OF PRIMARY DIFFICULTIES IN INTERSUBJECTOVITY INTERACTING WITH ABNORMAL DEVELOPMENT OF BRAIN CONNECTIVITY , 2007 .

[30]  Lianhong Cai,et al.  Speech emotion classification with the combination of statistic features and temporal features , 2004, 2004 IEEE International Conference on Multimedia and Expo (ICME) (IEEE Cat. No.04TH8763).

[31]  Ibon Saratxaga,et al.  Designing and Recording an Emotional Speech Database for Corpus Based Synthesis in Basque , 2006, LREC.

[32]  P. Kuhl,et al.  Acoustic determinants of infant preference for motherese speech , 1987 .

[33]  Malcolm Slaney,et al.  BabyEars: A recognition system for affective vocalizations , 2003, Speech Commun..

[34]  Jérôme Farinas,et al.  Rhythmic unit extraction and modelling for automatic language identification , 2005, Speech Commun..

[35]  Panayiotis G. Georgiou,et al.  Real-time Emotion Detection System using Speech: Multi-modal Fusion of Different Timescale Features , 2007, 2007 IEEE 9th Workshop on Multimedia Signal Processing.

[36]  Loïc Kessous,et al.  The relevance of feature type for the automatic classification of emotional user states: low level descriptors and functionals , 2007, INTERSPEECH.

[37]  A. Fernald,et al.  Expanded Intonation Contours in Mothers' Speech to Newborns. , 1984 .

[38]  Yang Li,et al.  Recognizing emotions in speech using short-term and long-term features , 1998, ICSLP.

[39]  Iain R. Murray,et al.  Toward the simulation of emotion in synthetic speech: a review of the literature on human vocal emotion. , 1993, The Journal of the Acoustical Society of America.

[40]  Björn W. Schuller,et al.  Comparing one and two-stage acoustic modeling in the recognition of emotion in speech , 2007, 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU).

[41]  S. Steidl,et al.  The Prosody of Pet Robot Directed Speech: Evidence from Children. , 2006 .

[42]  Astrid Paeschke,et al.  A database of German emotional speech , 2005, INTERSPEECH.

[43]  Eric Keller,et al.  The Analysis of Voice Quality in Speech Processing , 2004, Summer School on Neural Networks.

[44]  Shrikanth Narayanan,et al.  Effects of emotion on different phoneme classes , 2004 .

[45]  K. Scherer Vocal affect expression: a review and a model for future research. , 1986, Psychological bulletin.