Emotion recognition from speech signals using new harmony features

In this paper we propose a new set of harmony features for automatic emotion recognition from speech signals. They are based on the psychoacoustic harmony perception known from music theory. Starting from the estimated pitch contour of an utterance, we calculate the circular autocorrelation of the pitch histogram on the logarithmic semitone scale. It measures the occurrence of different two-pitch intervals which cause a consonant or dissonant impression. Experiments of emotion recognition using these harmony parameters in addition to state of the art features show an improved recognition performance.

[1]  R. Plomp,et al.  Tonal consonance and critical bandwidth. , 1965, The Journal of the Acoustical Society of America.

[2]  David G. Stork,et al.  Pattern Classification , 1973 .

[3]  H. Schlosberg Three dimensions of emotion. , 1954, Psychological review.

[4]  Bin Yang,et al.  The Relevance of Voice Quality Features in Speaker Independent Emotion Recognition , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[5]  K. Stevens,et al.  Classification of glottal vibration from acoustic measurements , 1995 .

[6]  Shrikanth S. Narayanan,et al.  Toward detecting emotions in spoken dialogs , 2005, IEEE Transactions on Speech and Audio Processing.

[7]  Roberto Gretter,et al.  Using prosodic information for disambiguation purposes , 2005, INTERSPEECH.

[8]  Constantine Kotropoulos,et al.  Emotional speech classification using Gaussian mixture models , 2005, 2005 IEEE International Symposium on Circuits and Systems.

[9]  Björn Schuller,et al.  Emotion recognition in the noise applying large acoustic feature sets , 2006, Speech Prosody 2006.

[10]  Alexander J. Smola,et al.  Learning with kernels , 1998 .

[11]  Josef Kittler,et al.  Floating search methods for feature selection with nonmonotonic criterion functions , 1994, Proceedings of the 12th IAPR International Conference on Pattern Recognition, Vol. 3 - Conference C: Signal Processing (Cat. No.94CH3440-5).

[12]  Subhash C. Bagui,et al.  Combining Pattern Classifiers: Methods and Algorithms , 2005, Technometrics.

[13]  Astrid Paeschke,et al.  A database of German emotional speech , 2005, INTERSPEECH.

[14]  Bin Yang,et al.  Cascaded emotion classification via psychological emotion dimensions using a large set of voice quality parameters , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[15]  Dimitrios Ververidis,et al.  A State of the Art Review on Emotional Speech Databases , 2003 .

[16]  Roddy Cowie,et al.  Speakers and hearers are people: reflections on speech deterioration as a consequence of acquired deafness , 1995 .

[17]  Gunnar Fant,et al.  Acoustic Theory Of Speech Production , 1960 .

[18]  D. Purves,et al.  The Statistical Structure of Human Speech Sounds Predicts Musical Universals , 2003, The Journal of Neuroscience.

[19]  Gregory H. Wakefield,et al.  Mathematical representation of joint time-chroma distributions , 1999, Optics & Photonics.

[20]  K. Scherer,et al.  Acoustic profiles in vocal emotion expression. , 1996, Journal of personality and social psychology.

[21]  Albino Nogueiras,et al.  Speech emotion recognition using hidden Markov models , 2001, INTERSPEECH.

[22]  Karl-Erik Spens,et al.  Profound deafness and speech communication , 1995 .

[23]  Roddy Cowie,et al.  Acoustic correlates of emotion dimensions in view of speech synthesis , 2001, INTERSPEECH.

[24]  Cinzia Avesani,et al.  THE ROLE OF PROSODY IN DISAMBIGUATING POTENTIALLY AMBIGUOU S UTTERANCES IN ENGLISH AND ITALIAN , 1997 .

[25]  S. Kay Fundamentals of statistical signal processing: estimation theory , 1993 .

[26]  Cecile Pereira DIMENSIONS OF EMOTIONAL MEANING IN SPEECH , 2000 .

[27]  Ludmila I. Kuncheva,et al.  Combining Pattern Classifiers: Methods and Algorithms , 2004 .

[28]  Ruili Wang,et al.  Ensemble methods for spoken emotion recognition in call-centres , 2007, Speech Commun..

[29]  Shigeo Abe DrEng Pattern Classification , 2001, Springer London.

[30]  Marko Lugger,et al.  AN INCREMENTAL ANALYSIS OF DIFFERENT FEATURE GROUPS IN SPEAKER INDEPENDENT EMOTION RECOGNITION , 2007 .

[31]  M. Lugger,et al.  Extracting voice quality contours using discrete hidden Markov models , 2008, Speech Prosody 2008.

[32]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[33]  Ulrich Ammon,et al.  Sociolinguistics: An international handbook of the science of language and society (Project announcement) , 1984, Language in Society.

[34]  Marc Schröder,et al.  Emotional speech synthesis: a review , 2001, INTERSPEECH.

[35]  Mann Oo. Hay Emotion recognition in human-computer interaction , 2012 .

[36]  H. Akaike A new look at the statistical model identification , 1974 .

[37]  H. Helmholtz,et al.  On the Sensations of Tone as a Physiological Basis for the Theory of Music , 2005 .

[38]  L. F. Barrett,et al.  Handbook of Emotions , 1993 .

[39]  David Talkin,et al.  A Robust Algorithm for Pitch Tracking ( RAPT ) , 2005 .

[40]  Milos Cernak Emotional aspects of intrinsic speech variabilities in automatic speech recognition , 2006 .

[41]  Björn W. Schuller,et al.  Hidden Markov model-based speech emotion recognition , 2003, 2003 International Conference on Multimedia and Expo. ICME '03. Proceedings (Cat. No.03TH8698).

[42]  Takuya Fujishima,et al.  Realtime Chord Recognition of Musical Sound: a System Using Common Lisp Music , 1999, ICMC.

[43]  Juha Reunanen,et al.  Overfitting in Making Comparisons Between Variable Selection Methods , 2003, J. Mach. Learn. Res..

[44]  Bin Yang,et al.  Combining classifiers with diverse feature sets for robust speaker independent emotion recognition , 2009, 2009 17th European Signal Processing Conference.

[45]  P. Ekman An argument for basic emotions , 1992 .

[46]  Tim Polzehl,et al.  Detecting real life anger , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[47]  F. Burkhardt,et al.  An Emotion-Aware Voice Portal , 2005 .

[48]  E. Schellenberg,et al.  Frequency ratios and the perception of tone patterns , 1994, Psychonomic bulletin & review.

[49]  R. Plutchik Emotion, a psychoevolutionary synthesis , 1980 .

[50]  Steven Kay,et al.  Fundamentals Of Statistical Signal Processing , 2001 .

[51]  Bin Yang,et al.  Psychological Motivated Multi-Stage Emotion Classification Exploiting Voice Quality Features , 2008 .

[52]  Suthathip Chuenwattanapranithi,et al.  PERCEIVING ANGER AND JOY IN SPEECH THROUGH THE SIZE CODE , 2007 .

[53]  Takashi X. Fujisawa,et al.  The Psychophysics of Harmony Perception: Harmony is a Three-Tone Phenomenon , 2006 .

[54]  Elmar Nöth,et al.  “You Stupid Tin Box” - Children Interacting with the AIBO Robot: A Cross-linguistic Emotional Speech Corpus , 2004, LREC.

[55]  A. Hanjalic,et al.  Extracting moods from pictures and sounds: towards truly personalized TV , 2006, IEEE Signal Processing Magazine.

[56]  Laurence Devillers,et al.  Real-life emotions detection with lexical and paralinguistic cues on human-human call center dialogs , 2006, INTERSPEECH.

[57]  J. Laver The phonetic description of voice quality , 1980 .

[58]  J. Jiang,et al.  Vocal fold physiology. , 2000, Otolaryngologic clinics of North America.

[59]  Hartmut Traunmüller Paralinguale Phänomene : (Paralinguistic phenomena) , 2005 .