Continuous emotion recognition with phonetic syllables

As research on the extraction of acoustic properties of speech for emotion recognition progresses, the need of investigating methods of feature extraction taking into account the necessities of real time processing systems becomes more important. Past works have shown the importance of syllables for the transmission of emotions, while classical research methods adopted in prosody show that it is important to concentrate on specific areas of the speech signal to study intonation phenomena. Technological approaches, however, are often designed to use the whole speech signal without taking into account the qualitative variability of the spectral content. Given this contrast with the theoretical basis around which prosodic research is pursued, we present here a feature extraction method built on the basis of a phonetic interpretation of the concept of syllable. In particular, we concentrate on the spectral content of syllabic nuclei, thus reducing the amount of information to be processed. Moreover, we introduce feature weighting based on syllabic prominence, thus not considering all the units of analysis as being equally important. The method is evaluated on a continuous, three-dimensional model of emotions built on the classical axes of Valence, Activation and Dominance and is shown to be competitive with state-of-the-art performance. The potential impact of this approach on the design of affective computing systems is also analysed.

[1]  Chai Wutiwiwatchai,et al.  Thai syllable segmentation for connected speech based on energy , 1998, IEEE. APCCAS 1998. 1998 IEEE Asia-Pacific Conference on Circuits and Systems. Microelectronics and Integrating Systems. Proceedings (Cat. No.98EX242).

[2]  Björn W. Schuller,et al.  The INTERSPEECH 2009 emotion challenge , 2009, INTERSPEECH.

[3]  A. Mehrabian Pleasure-arousal-dominance: A general framework for describing and measuring individual differences in Temperament , 1996 .

[4]  Peter Robinson,et al.  Dimensional affect recognition using Continuous Conditional Random Fields , 2013, 2013 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG).

[5]  Rosaria Silipo,et al.  AUTOMATIC TRANSCRIPTION OF PROSODIC STRESS FOR SPONTANEOUS ENGLISH DISCOURSE , 1999 .

[6]  Shrikanth S. Narayanan,et al.  Primitives-based evaluation and estimation of emotions in speech , 2007, Speech Commun..

[7]  Shrikanth S. Narayanan,et al.  The Vera am Mittag German audio-visual emotional speech database , 2008, 2008 IEEE International Conference on Multimedia and Expo.

[8]  Eberhard Zwicker,et al.  Direct Comparisons between the Sensations Produced by Frequency Modulation and Amplitude Modulation , 1962 .

[9]  Astrid Paeschke,et al.  A database of German emotional speech , 2005, INTERSPEECH.

[10]  K. Scherer,et al.  Mapping emotions into acoustic space: The role of voice production , 2011, Biological Psychology.

[11]  Lin-Shan Lee,et al.  Feature analysis for emotion recognition from Mandarin speech considering the special characteristics of Chinese language , 2006, INTERSPEECH.

[12]  Edward J. Delp,et al.  Digital watermarking: algorithms and applications , 2001, IEEE Signal Process. Mag..

[13]  Petra Wagner,et al.  On automatic prominence detection for German , 2007, INTERSPEECH.

[14]  David House Differential perception of tonal contours through the syllable , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[15]  Francesco Cutugno,et al.  A syllable segmentation algorithm for English and italian , 2003, INTERSPEECH.

[16]  Ingo Siegert,et al.  Vowels formants analysis allows straightforward detection of high arousal emotions , 2011, 2011 IEEE International Conference on Multimedia and Expo.

[17]  Amalia Arvaniti,et al.  Rhythm, Timing and the Timing of Rhythm , 2009, Phonetica.

[18]  D. House Tonal perception in speech , 1990 .

[19]  Piet Mertens,et al.  The Prosogram: Semi-Automatic Transcription of Prosody Based on a Tonal Perception Model , 2004 .

[20]  D. Maiwald,et al.  Ein Funktionsschema des Gehors zur Beschreibung der Erkennbarkeit kleiner Frequenz und Amplitudenanderungen , 1967 .

[21]  M. Rossi,et al.  Le seuil de glissando ou seuil de perception des variations tonales pour les sons de la parole , 1971 .

[22]  Bistra Andreeva,et al.  Do Rhythm Measures Tell us Anything about Language Type , 2003 .

[23]  Roddy Cowie,et al.  FEELTRACE: an instrument for recording perceived emotion in real time , 2000 .

[24]  J. Russell A circumplex model of affect. , 1980 .

[25]  Bayya Yegnanarayana,et al.  Extraction and representation of prosodic features for language and speaker recognition , 2008, Speech Commun..

[26]  I. Pollack,et al.  Detection of rate of change of auditory frequency. , 1968, Journal of experimental psychology.

[27]  K. Scherer,et al.  Vocal expression of emotion. , 2003 .

[28]  Christophe d'Alessandro,et al.  Automatic pitch contour stylization using a model of tonal perception , 1995, Comput. Speech Lang..

[29]  Tiago H. Falk,et al.  Automatic speech emotion recognition using modulation spectral features , 2011, Speech Commun..

[30]  Mansour Sheikhan,et al.  Emotion recognition improvement using normalized formant supplementary features by hybrid of DTW-MLP-GMM model , 2012, Neural Computing and Applications.

[31]  I. Daum,et al.  The contribution of speech rate and pitch variation to the perception of vocal emotions in a German and an American sample , 2001 .

[32]  Robert F. Port,et al.  Rhythmic constraints on stress timing in English , 1998 .

[33]  Antonio Origlia,et al.  On the Use of the Rhythmogram for Automatic Syllabic Prominence Detection , 2011, INTERSPEECH.

[34]  Björn W. Schuller,et al.  Frame vs. Turn-Level: Emotion Recognition from Speech Considering Static and Dynamic Processing , 2007, ACII.

[35]  P. Boersma ACCURATE SHORT-TERM ANALYSIS OF THE FUNDAMENTAL FREQUENCY AND THE HARMONICS-TO-NOISE RATIO OF A SAMPLED SOUND , 1993 .

[36]  Miriam Reiner,et al.  Stroop Interference and Facilitation Effects in Kinesthetic and Haptic Tasks , 2010, Adv. Hum. Comput. Interact..

[37]  Anne Lacheret,et al.  A corpus-based learning method for prominence detection in spontaneous speech , 2009 .

[38]  Russell L. Sergeant,et al.  Sensitivity to Unidirectional Frequency Modulation , 1961 .

[39]  Susanne Burger,et al.  Syllable detection in read and spontaneous speech , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[40]  P. Ekman An argument for basic emotions , 1992 .

[41]  Björn W. Schuller,et al.  Combining speech recognition and acoustic word emotion models for robust text-independent emotion recognition , 2008, 2008 IEEE International Conference on Multimedia and Expo.

[42]  Maja Pantic,et al.  The SEMAINE corpus of emotionally coloured character interactions , 2010, 2010 IEEE International Conference on Multimedia and Expo.

[43]  Björn W. Schuller,et al.  Segmenting into Adequate Units for Automatic Recognition of Emotion-Related Episodes: A Speech-Based Approach , 2010, Adv. Hum. Comput. Interact..

[44]  K. Kroschel,et al.  Evaluation of natural emotions using self assessment manikins , 2005, IEEE Workshop on Automatic Speech Recognition and Understanding, 2005..

[45]  David House Perception of prepausal tonal contours: implications for automatic stylization of intonation , 1995, EUROSPEECH.

[46]  Luis Villaseñor Pineda,et al.  Features selection for primitives estimation on emotional speech , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[47]  M E Schouten,et al.  Identification and discrimination of sweep tones , 1985, Perception & psychophysics.

[48]  Rosalind W. Picard,et al.  Recognizing affect from speech prosody using hierarchical graphical models , 2011, Speech Commun..

[49]  V. Dellwo Rhythm and Speech Rate: A Variation Coefficient for deltaC , 2006 .

[50]  Yongxin Wang,et al.  Emotional Audio-Visual Speech Synthesis Based on PAD , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[51]  F. Ramus,et al.  Correlates of linguistic rhythm in the speech signal , 1999, Cognition.

[52]  Björn W. Schuller,et al.  Emotion representation, analysis and synthesis in continuous space: A survey , 2011, Face and Gesture 2011.

[53]  K. Kroschel,et al.  Emotion Estimation in Speech Using a 3D Emotion Space Concept , 2007 .

[54]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[55]  J. G. Taylor,et al.  Emotion recognition in human-computer interaction , 2005, Neural Networks.

[56]  Klaus R. Scherer,et al.  Emotion dimensions and formant position , 2009, INTERSPEECH.

[57]  George N. Votsis,et al.  Emotion recognition in human-computer interaction , 2001, IEEE Signal Process. Mag..

[58]  A. Møller,et al.  Dynamic Properties of Cochlear Nucleus Units in Response to Excitatory and Inhibitory Tones , 1974 .

[59]  D. Klatt,et al.  Discrimination of fundamental frequency contours in synthetic speech: implications for models of pitch perception. , 1973, The Journal of the Acoustical Society of America.

[60]  Hatice Gunes,et al.  Continuous Prediction of Spontaneous Affect from Multiple Cues and Modalities in Valence-Arousal Space , 2011, IEEE Transactions on Affective Computing.

[61]  Anton Batliner,et al.  Word Accent and Emotion , 2010 .

[62]  Antonio Origlia,et al.  A dynamic tonal perception model for optimal pitch stylization , 2013, Comput. Speech Lang..

[63]  M. Rossi,et al.  Interactions of Intensity Glides and Frequency Glissandos , 1978, Language and speech.

[64]  P. Mermelstein Automatic segmentation of speech into syllabic units. , 1975, The Journal of the Acoustical Society of America.

[65]  J. F. Kaiser,et al.  On a simple algorithm to calculate the 'energy' of a signal , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[66]  Bruno Apolloni,et al.  Special Issue: Contemporary development of neural computation and applications , 2012, Neural Computing and Applications.

[67]  Björn W. Schuller,et al.  Recognising realistic emotions and affect in speech: State of the art and lessons learnt from the first challenge , 2011, Speech Commun..

[68]  Carlo Drioli,et al.  Emotions and voice quality: experiments with sinusoidal modeling , 2003 .

[69]  J. Terken Fundamental frequency and perceived prominence of accented syllables , 1989 .