Time-Scale Feature Extractions for Emotional Speech Characterization

Emotional speech characterization is an important issue for the understanding of interaction. This article discusses the time-scale analysis problem in feature extraction for emotional speech processing. We describe a computational framework for combining segmental and supra-segmental features for emotional speech detection. The statistical fusion is based on the estimation of local a posteriori class probabilities and the overall decision employs weighting factors directly related to the duration of the individual speech segments. This strategy is applied to a real-world application: detection of Italian motherese in authentic and longitudinal parent–infant interaction at home. The results suggest that short- and long-term information, respectively, represented by the short-term spectrum and the prosody parameters (fundamental frequency and energy) provide a robust and efficient time-scale analysis. A similar fusion methodology is also investigated by the use of a phonetic-specific characterization process. This strategy is motivated by the fact that there are variations across emotional states at the phoneme level. A time-scale based on both vowels and consonants is proposed and it provides a relevant and discriminant feature space for acted emotion recognition. The experimental results on two different databases Berlin (German) and Aholab (Basque) show that the best performance are obtained by our phoneme-dependent approach. These findings demonstrate the relevance of taking into account phoneme dependency (vowels/consonants) for emotional speech characterization.

[1]  David G. Stork,et al.  Pattern Classification , 1973 .

[2]  K. Stevens,et al.  Emotions and speech: some acoustical correlates. , 1972, The Journal of the Acoustical Society of America.

[3]  Filippo Muratori,et al.  AUTISM AD A DOWNSTREAM EFFECT OF PRIMARY DIFFICULTIES IN INTERSUBJECTOVITY INTERACTING WITH ABNORMAL DEVELOPMENT OF BRAIN CONNECTIVITY , 2007 .

[4]  Nick Campbell,et al.  On the Use of NonVerbal Speech Sounds in Human Communication , 2007, COST 2102 Workshop.

[5]  A. Kendon,et al.  Organization of behavior in face-to-face interaction , 1975 .

[6]  Subhash C. Bagui,et al.  Combining Pattern Classifiers: Methods and Algorithms , 2005, Technometrics.

[7]  Marcos Faúndez-Zanuy,et al.  Maximum likelihood linear programming data fusion for speaker recognition , 2009, Speech Commun..

[8]  E. Vesterinen,et al.  Affective Computing , 2009, Encyclopedia of Biometrics.

[9]  Anna Esposito,et al.  Multimodal Signals: Cognitive and Algorithmic Issues, COST Action 2102 and euCognition International School Vietri sul Mare, Italy, April 21-26, 2008, Revised Selected and Invited Papers , 2009, COST 2102 School.

[10]  Iain R. Murray,et al.  Toward the simulation of emotion in synthetic speech: a review of the literature on human vocal emotion. , 1993, The Journal of the Acoustical Society of America.

[11]  Loïc Kessous,et al.  The relevance of feature type for the automatic classification of emotional user states: low level descriptors and functionals , 2007, INTERSPEECH.

[12]  Shrikanth Narayanan,et al.  Effects of emotion on different phoneme classes , 2004 .

[13]  S. Steidl,et al.  The Prosody of Pet Robot Directed Speech: Evidence from Children. , 2006 .

[14]  Mohamed Chetouani,et al.  Motherese detection based on segmental and supra-segmental features , 2008, 2008 19th International Conference on Pattern Recognition.

[15]  Ludmila I. Kuncheva,et al.  Combining Pattern Classifiers: Methods and Algorithms , 2004 .

[16]  A. Pentland Social Signal Processing [Exploratory DSP] , 2007, IEEE Signal Processing Magazine.

[17]  Björn W. Schuller,et al.  Frame vs. Turn-Level: Emotion Recognition from Speech Considering Static and Dynamic Processing , 2007, ACII.

[18]  Cynthia Breazeal,et al.  Recognition of Affective Communicative Intent in Robot-Directed Speech , 2002, Auton. Robots.

[19]  A. Fernald,et al.  Intonation and communicative intent in mothers' speech to infants: is the melody the message? , 1989, Child development.

[20]  Björn W. Schuller,et al.  Comparing one and two-stage acoustic modeling in the recognition of emotion in speech , 2007, 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU).

[21]  Maria Uther,et al.  Do you speak E-NG-L-I-SH? A comparison of foreigner- and infant-directed speech , 2007, Speech Commun..

[22]  K. Scherer Vocal affect expression: a review and a model for future research. , 1986, Psychological bulletin.

[23]  David G. Stork,et al.  Pattern Classification (2nd ed.) , 1999 .

[24]  Ibon Saratxaga,et al.  Designing and Recording an Emotional Speech Database for Corpus Based Synthesis in Basque , 2006, LREC.

[25]  Björn W. Schuller,et al.  Hidden Markov model-based speech emotion recognition , 2003, 2003 International Conference on Multimedia and Expo. ICME '03. Proceedings (Cat. No.03TH8698).

[26]  M. Argyle Bodily communication, 2nd ed. , 1988 .

[27]  Mohamed Chetouani,et al.  Automatic Motherese Detection for Face-to-Face Interaction Analysis , 2008, COST 2102 School.

[28]  Douglas A. Reynolds,et al.  Speaker identification and verification using Gaussian mixture speaker models , 1995, Speech Commun..

[29]  Werner Verhelst,et al.  An evaluation of the robustness of existing supervised machine learning approaches to the classification of emotions in speech , 2007, Speech Commun..

[30]  Lianhong Cai,et al.  Speech emotion classification with the combination of statistic features and temporal features , 2004, 2004 IEEE International Conference on Multimedia and Expo (ICME) (IEEE Cat. No.04TH8763).

[31]  Fabien Ringeval,et al.  Exploiting a Vowel Based Approach for Acted Emotion Recognition , 2008, COST 2102 Workshop.

[32]  Eric Keller,et al.  SPEECH TIMING: APPROACHES TO SPEECH RHYTHM , 2007 .

[33]  Catherine I. Watson,et al.  Some acoustic characteristics of emotion , 1998, ICSLP.

[34]  Astrid Paeschke,et al.  A database of German emotional speech , 2005, INTERSPEECH.

[35]  Jérôme Farinas,et al.  Rhythmic unit extraction and modelling for automatic language identification , 2005, Speech Commun..

[36]  I. Linnankoski,et al.  Expression or emotional-motivational connotations with a one-word utterance. , 1997, The Journal of the Acoustical Society of America.

[37]  Marcos Faúndez-Zanuy,et al.  Investigation on LP-residual representations for speaker identification , 2009, Pattern Recognit..

[38]  Régine André-Obrecht,et al.  A new statistical approach for the automatic segmentation of continuous speech signals , 1988, IEEE Trans. Acoust. Speech Signal Process..

[39]  F. Ramus,et al.  The role of speech rhythm in language discrimination: further tests with a non-human primate. , 2005, Developmental science.

[40]  Marie-Christine Laznik,et al.  Les interactions sonores entre les bébés devenus autistes et leurs parents , 2005 .

[41]  Eric Keller,et al.  The Analysis of Voice Quality in Speech Processing , 2004, Summer School on Neural Networks.

[42]  A. Fernald,et al.  Expanded Intonation Contours in Mothers' Speech to Newborns. , 1984 .

[43]  D. Burnham,et al.  What's New, Pussycat? On Talking to Babies and Animals , 2002, Science.

[44]  Shrikanth Narayanan,et al.  Emotion to emotion speech conversion in phoneme level , 2004 .

[45]  Yang Li,et al.  Recognizing emotions in speech using short-term and long-term features , 1998, ICSLP.

[46]  Maja Pantic,et al.  Social signal processing: Survey of an emerging domain , 2009, Image Vis. Comput..

[47]  Malcolm Slaney,et al.  Baby Ears: a recognition system for affective vocalizations , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[48]  P. Kuhl,et al.  Acoustic determinants of infant preference for motherese speech , 1987 .

[49]  Cynthia Breazeal,et al.  Characterizing and Processing Robot-Directed Speech , 2001 .

[50]  Rosalind W. Picard Affective computing: (526112012-054) , 1997 .

[51]  F. Ramus,et al.  Correlates of linguistic rhythm in the speech signal , 1999, Cognition.

[52]  E. Grabe,et al.  Durational variability in speech and the rhythm class hypothesis , 2005 .

[53]  Alex Pentland,et al.  Social signals, their function, and automatic analysis: a survey , 2008, ICMI '08.

[54]  Panayiotis G. Georgiou,et al.  Real-time Emotion Detection System using Speech: Multi-modal Fusion of Different Timescale Features , 2007, 2007 IEEE 9th Workshop on Multimedia Signal Processing.

[55]  Filippo Muratori,et al.  Early Behavioral Development in Autistic Children: The First 2 Years of Life through Home Movies , 2001, Psychopathology.

[56]  A. Fernald Intonation and Communicative Intent in Mothers' Speech to Infants: Is the Melody the Message?. , 1989 .