Emotion Recognition Using Vocal Tract Information

This chapter discuss about the emotion specific information offered by vocal tract features. Well known spectral features such as linear prediction cepstral coefficients (LPCCs) and mel frequency cepstral coefficients (MFCCs) are used as the correlates of vocal tract information for discriminating the emotions. In addition to LPCCs and MFCCs, formant related features are also explored in this work for recognizing emotions from speech. Extraction of the above mentioned spectral features is discussed in brief. In this study, auto-associative neural network (AANN) models and Gaussian mixture models (GMM) are used for classifying the emotions. Functionality of AANN and GMM are briefly described. Emotion recognition performance obtained using different vocal tract features are compared over Indian and Berlin emotional speech databases. Performance of neural networks and Gaussian mixture models in classifying the emotional utterances based on vocal tract features is also evaluated.

[1]  Chin-Hui Lee,et al.  Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains , 1994, IEEE Trans. Speech Audio Process..

[2]  John H. L. Hansen,et al.  ICARUS: Source generator based real-time recognition of speech in noisy stressful and Lombard effect environments , 1995, Speech Commun..

[3]  Heinz Hügli,et al.  Usefulness of the LPC-residue in text-independent speaker verification , 1995, Speech Commun..

[4]  W. James,et al.  What Is an Emotion , 1977 .

[5]  Shashidhar G. Koolagudi,et al.  Emotion recognition from speech using sub-syllabic and pitch synchronous spectral features , 2012, Int. J. Speech Technol..

[6]  J. S. Mason,et al.  Velocity and acceleration features in speaker recognition , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[7]  Malcolm Slaney,et al.  BabyEars: A recognition system for affective vocalizations , 2003, Speech Commun..

[8]  Douglas A. Reynolds,et al.  Speaker identification and verification using Gaussian mixture speaker models , 1995, Speech Commun..

[9]  K. Stevens,et al.  Emotions and speech: some acoustical correlates. , 1972, The Journal of the Acoustical Society of America.

[10]  B. Yegnanarayana,et al.  Combining evidence from subsegmental and segmental features for audio clip classification , 2008, TENCON 2008 - 2008 IEEE Region 10 Conference.

[11]  W. Sendlmeier,et al.  Verification of acoustical correlates of emotional speech using formant-synthesis , 2000 .

[12]  Bishnu S. Atal,et al.  Linear prediction analysis of speech based on a pole-zero representation. , 1975, The Journal of the Acoustical Society of America.

[13]  Shashidhar G. Koolagudi,et al.  Selection of Suitable Features for Modeling the Durations of Syllables , 2010, J. Softw. Eng. Appl..

[14]  Chung-Hsien Wu,et al.  Emotion recognition from text using semantic labels and separable mixture models , 2006, TALIP.

[15]  Roddy Cowie,et al.  Automatic statistical analysis of the signal and prosodic signs of emotion in speech , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[16]  Bayya Yegnanarayana,et al.  Analysis of laugh signals for detecting in continuous speech , 2009, INTERSPEECH.

[17]  Mike Edgington,et al.  Investigating the limitations of concatenative synthesis , 1997, EUROSPEECH.

[18]  Ioannis Pitas,et al.  Automatic emotional speech classification , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[19]  Bayya Yegnanarayana,et al.  Prosody modification using instants of significant excitation , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[20]  Bayya Yegnanarayana,et al.  Determining Mixing Parameters From Multispeaker Data Using Speech-Specific Information , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[21]  D. Mitchell Wilkes,et al.  Acoustical properties of speech as indicators of depression and suicidal risk , 2000, IEEE Transactions on Biomedical Engineering.

[22]  Abdul Wahab,et al.  Features extraction for speech emotion , 2008, J. Comput. Methods Sci. Eng..

[23]  Bayya Yegnanarayana,et al.  Combining evidence from residual phase and MFCC features for speaker recognition , 2006, IEEE Signal Processing Letters.

[24]  John L. Arnott,et al.  Implementation and testing of a system for producing emotion-by-rule in synthetic speech , 1995, Speech Commun..

[25]  Carlo Drioli,et al.  Modifications of phonetic labial targets in emotive speech: effects of the co-production of speech and emotions , 2004, Speech Commun..

[26]  Shashidhar G. Koolagudi,et al.  Speech Emotion Recognition Using Segmental Level Prosodic Analysis , 2011, 2011 International Conference on Devices and Communications (ICDeCom).

[27]  Bayya Yegnanarayana,et al.  Analysis of autoassociative mapping neural networks , 1999, IJCNN'99. International Joint Conference on Neural Networks. Proceedings (Cat. No.99CH36339).

[28]  K. Sreenivasa Rao,et al.  Application of prosody models for developing speech systems in Indian languages , 2011, Int. J. Speech Technol..

[29]  E. B. Newman,et al.  A Scale for the Measurement of the Psychological Magnitude Pitch , 1937 .

[30]  H. Wakita Residual energy of linear prediction applied to vowel and speaker recognition , 1976 .

[31]  Tsang-Long Pao,et al.  Combining Acoustic Features for Improved Emotion Recognition in Mandarin Speech , 2005, ACII.

[32]  Qi Luo,et al.  Study on Speech Emotion Recognition System in E-Learning , 2007, HCI.

[33]  Marc Schröder,et al.  Emotional speech synthesis: a review , 2001, INTERSPEECH.

[34]  Jiahong Yuan,et al.  The acoustic realization of anger, fear, joy and sadness in Chinese , 2002, INTERSPEECH.

[35]  D. O'Shaughnessy,et al.  Linear predictive coding , 1988, IEEE Potentials.

[36]  B. Yegnanarayana,et al.  Artificial Neural Networks , 2004 .

[37]  Roddy Cowie,et al.  Describing the emotional states that are expressed in speech , 2003, Speech Commun..

[38]  K E Cummings,et al.  Analysis of the glottal excitation of emotionally styled and stressed speech. , 1995, The Journal of the Acoustical Society of America.

[39]  Paul Dalsgaard,et al.  Eurospeech 2001, Scandinavia: 7th European Conference on Speech Communication and Technology, September 3-7, 2001, Aalborg Congress and Culture Centre, Aalborg, Denmark , 2001 .

[40]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[41]  Shrikanth Narayanan,et al.  Recognition of negative emotions from the speech signal , 2001, IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU '01..

[42]  Bayya Yegnanarayana,et al.  Analysis of Lombard speech using excitation source information , 2009, INTERSPEECH.

[43]  Theodoros Iliou,et al.  Statistical Evaluation of Speech Features for Emotion Recognition , 2009, 2009 Fourth International Conference on Digital Telecommunications.

[44]  Kishore Prahallad,et al.  Source and system features for speaker recognition using AANN models , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[45]  Klaus R. Scherer,et al.  Vocal communication of emotion: A review of research paradigms , 2003, Speech Commun..

[46]  Björn W. Schuller,et al.  Speech emotion recognition combining acoustic features and linguistic information in a hybrid support vector machine-belief network architecture , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[47]  Shashidhar G. Koolagudi,et al.  Emotion recognition from speech: a review , 2012, International Journal of Speech Technology.

[48]  C. Krishna Mohan,et al.  Classification of sport videos using edge-based features and autoassociative neural network models , 2010, Signal Image Video Process..

[49]  Shashidhar G. Koolagudi,et al.  Real life emotion classification using VOP and pitch based spectral features , 2010, 2010 Annual IEEE India Conference (INDICON).

[50]  Zheng Fang,et al.  Comparison of different implementations of MFCC , 2001 .

[51]  Frank Dellaert,et al.  Recognizing emotion in speech , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[52]  Shiqing Zhang,et al.  Emotion Recognition in Chinese Natural Speech by Combining Prosody and Voice Quality Features , 2008, ISNN.

[53]  Nikos Fakotakis,et al.  Comparative Evaluation of Various MFCC Implementations on the Speaker Verification Task , 2007 .

[54]  Shashidhar G. Koolagudi,et al.  Emotion recognition from speech using global and local prosodic features , 2013, Int. J. Speech Technol..

[55]  John H. L. Hansen,et al.  Discrete-Time Processing of Speech Signals , 1993 .

[56]  B. Yegnanarayana,et al.  Autoassociative neural network models for language identification , 2004, International Conference on Intelligent Sensing and Information Processing, 2004. Proceedings of.

[57]  Yonghong Yan,et al.  Speech Emotion Recognition Using Both Spectral and Prosodic Features , 2009, 2009 International Conference on Information Engineering and Computer Science.

[58]  Rabul Hussain Laskar,et al.  Comparing ANN and GMM in a voice conversion framework , 2012, Appl. Soft Comput..

[59]  Valery A. Petrushin,et al.  RUSLANA: a database of Russian emotional utterances , 2002, INTERSPEECH.

[60]  John H. L. Hansen,et al.  Frequency band analysis for stress detection using a teager energy operator based feature , 2002, INTERSPEECH.

[61]  Kornel Laskowski,et al.  Emotion recognition in spontaneous speech using GMMs , 2006, INTERSPEECH.

[62]  Valery A. Petrushin,et al.  Emotion recognition in speech signal: experimental study, development, and application , 2000, INTERSPEECH.

[63]  Joseph Picone,et al.  Signal modeling techniques in speech recognition , 1993, Proc. IEEE.

[64]  Alex Waibel,et al.  EMOTION-SENSITIVE HUMAN-COMPUTER INTERFACES , 2000 .

[65]  Ryohei Nakatsu,et al.  Emotion Recognition in Speech Using Neural Networks , 2000, Neural Computing & Applications.

[66]  Kishore Prahallad,et al.  AANN: an alternative to GMM for pattern recognition , 2002, Neural Networks.

[67]  Robert M. Gray,et al.  An Algorithm for Vector Quantizer Design , 1980, IEEE Trans. Commun..

[68]  Constantine Kotropoulos,et al.  Emotional speech recognition: Resources, features, and methods , 2006, Speech Commun..

[69]  George N. Votsis,et al.  Emotion recognition in human-computer interaction , 2001, IEEE Signal Process. Mag..

[70]  S. R. Mahadeva Prasanna,et al.  Speech enhancement using excitation source information , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[71]  John L. Arnott,et al.  Emotional stress in synthetic speech: Progress and future directions , 1996, Speech Commun..

[72]  S. Steidl,et al.  The Prosody of Pet Robot Directed Speech: Evidence from Children. , 2006 .

[73]  B. Yegnanarayana,et al.  Exploring features for audio clip classification using LP residual and AANN models , 2004, International Conference on Intelligent Sensing and Information Processing, 2004. Proceedings of.

[74]  B. Yegnanarayana,et al.  Autoassociative neural network models for online speaker verification using source features from vowels , 2002, Proceedings of the 2002 International Joint Conference on Neural Networks. IJCNN'02 (Cat. No.02CH37290).

[75]  Eliathamby Ambikairajah,et al.  Analysis of an MFCC-based audio indexing system for efficient coding of multimedia sources , 2005, Proceedings of the Eighth International Symposium on Signal Processing and Its Applications, 2005..

[76]  Simon Haykin,et al.  Neural Networks: A Comprehensive Foundation , 1998 .

[77]  Klaus R. Scherer,et al.  The role of intonation in emotional expressions , 2005, Speech Commun..

[78]  Shashidhar G. Koolagudi,et al.  Characterization and recognition of emotions from speech using excitation source information , 2013, Int. J. Speech Technol..

[79]  K. Sreenivasa Rao Predicting Prosody from Text for Text-to-Speech Synthesis , 2012, Springer Briefs in Electrical and Computer Engineering.

[80]  Yongzhao Zhan,et al.  Adaptive and Optimal Classification of Speech Emotion Recognition , 2008, 2008 Fourth International Conference on Natural Computation.

[81]  Douglas A. Reynolds,et al.  A Tutorial on Text-Independent Speaker Verification , 2004, EURASIP J. Adv. Signal Process..

[82]  Shashidhar G. Koolagudi,et al.  IITKGP-SESC: Speech Database for Emotion Analysis , 2009, IC3.

[83]  Shrikanth S. Narayanan,et al.  The Vera am Mittag German audio-visual emotional speech database , 2008, 2008 IEEE International Conference on Multimedia and Expo.

[84]  Roddy Cowie,et al.  Automatic recognition of emotion from voice: a rough benchmark , 2000 .

[85]  K SREENIVASA RAO,et al.  Role of neural network models for developing speech systems , 2011 .

[86]  Marc Schröder,et al.  Expressing vocal effort in concatenative synthesis , 2003 .

[87]  Milan Sigmund,et al.  Spectral Analysis of Speech under Stress , 2007 .

[88]  Harry Shum,et al.  Emotion Detection from Speech to Enrich Multimedia Content , 2001, IEEE Pacific Rim Conference on Multimedia.

[89]  Rosalind W. Picard,et al.  Modeling drivers' speech under stress , 2003, Speech Commun..

[90]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[91]  Douglas A. Reynolds,et al.  Robust text-independent speaker identification using Gaussian mixture speaker models , 1995, IEEE Trans. Speech Audio Process..

[92]  Jon Sánchez,et al.  Automatic emotion recognition using prosodic parameters , 2005, INTERSPEECH.

[93]  V. Ramu Reddy,et al.  Development of syllable-based text to speech synthesis system in Bengali , 2011, Int. J. Speech Technol..

[94]  S. R. Mahadeva Prasanna,et al.  Vowel Onset Point Detection Using Source, Spectral Peaks, and Modulation Spectrum Energies , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[95]  Yu Hu,et al.  A Novel Source Analysis Method by Matching Spectral Characters of LF Model with STRAIGHT Spectrum , 2005, ACII.

[96]  John H. L. Hansen,et al.  Nonlinear feature based classification of speech under stress , 2001, IEEE Trans. Speech Audio Process..

[97]  Marc Schröder,et al.  Developing a Consistent View on Emotion-Oriented Computing , 2005, MLMI.

[98]  B. Atal Automatic Speaker Recognition Based on Pitch Contours , 1969 .

[99]  Zengfu Wang,et al.  An Emotion Space Model for Recognition of Emotions in Spoken Chinese , 2005, ACII.

[100]  Fakhri Karray,et al.  Survey on speech emotion recognition: Features, classification schemes, and databases , 2011, Pattern Recognit..

[101]  David G. Stork,et al.  Pattern Classification , 1973 .

[102]  B. Yegnanarayana,et al.  Perceived loudness of speech based on the characteristics of glottal excitation source. , 2009, The Journal of the Acoustical Society of America.

[103]  Bin Yang,et al.  The Relevance of Voice Quality Features in Speaker Independent Emotion Recognition , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[104]  Shrikanth S. Narayanan,et al.  Toward detecting emotions in spoken dialogs , 2005, IEEE Transactions on Speech and Audio Processing.

[105]  Say Wei Foo,et al.  Speech emotion recognition using hidden Markov models , 2003, Speech Commun..

[106]  B. Yegnanarayana,et al.  Online text-independent speaker verification system using autoassociative neural network models , 2001, IJCNN'01. International Joint Conference on Neural Networks. Proceedings (Cat. No.01CH37222).

[107]  Qi Li,et al.  Recognition of noisy speech using dynamic spectral subband centroids , 2004, IEEE Signal Processing Letters.

[108]  J. Makhoul,et al.  Linear prediction: A tutorial review , 1975, Proceedings of the IEEE.

[109]  Ling Guan,et al.  An investigation of speech-based human emotion recognition , 2004, IEEE 6th Workshop on Multimedia Signal Processing, 2004..

[110]  Marc Schröder,et al.  Experimental study of affect bursts , 2003, Speech Commun..

[111]  Shashidhar G. Koolagudi,et al.  Neural network based feature transformation for emotion independent speaker identification , 2012, Int. J. Speech Technol..

[112]  Eric Keller,et al.  Prosodic aspects of speech , 1995 .

[113]  Douglas A. Reynolds,et al.  Speaker Verification Using Adapted Gaussian Mixture Models , 2000, Digit. Signal Process..

[114]  Björn Granström,et al.  Measurements of articulatory variation in expressive speech for a set of Swedish vowels , 2004, Speech Commun..

[115]  Kishore Prahallad,et al.  Spectral Mapping Using Artificial Neural Networks for Voice Conversion , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[116]  Sun-Yuan Kung,et al.  Principal Component Neural Networks: Theory and Applications , 1996 .

[117]  Ragini Verma,et al.  Class-level spectral features for emotion recognition , 2010, Speech Commun..

[118]  Shashidhar G. Koolagudi,et al.  Emotion recognition from speech using source, system, and prosodic features , 2012, Int. J. Speech Technol..

[119]  Astrid Paeschke,et al.  A database of German emotional speech , 2005, INTERSPEECH.

[120]  Marc Cavazza,et al.  EmoEmma: emotional speech input for interactive storytelling , 2009, AAMAS.

[121]  Lin-Shan Lee,et al.  Feature analysis for emotion recognition from Mandarin speech considering the special characteristics of Chinese language , 2006, INTERSPEECH.

[122]  Bayya Yegnanarayana,et al.  Epoch Extraction From Speech Signals , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[123]  Elmar Nöth,et al.  “You Stupid Tin Box” - Children Interacting with the AIBO Robot: A Cross-linguistic Emotional Speech Corpus , 2004, LREC.

[124]  Valery A. Petrushin,et al.  EMOTION IN SPEECH: RECOGNITION AND APPLICATION TO CALL CENTERS , 1999 .

[125]  A. D. Craig,et al.  Interoception and emotion: A neuroanatomical perspective. , 2008 .

[126]  Ralf Kompe,et al.  Emotional space improves emotion recognition , 2002, INTERSPEECH.

[127]  Ryohei Nakatsu,et al.  Emotion recognition and its application to computer agents with spontaneous interactive capabilities , 2000, Knowl. Based Syst..

[128]  M. Alpert,et al.  Reflections of depression in acoustic measures of the patient's speech. , 2001, Journal of affective disorders.

[129]  Klaus R. Scherer,et al.  Acoustic correlates of task load and stress , 2002, INTERSPEECH.

[130]  Zhigang Deng,et al.  An acoustic study of emotions expressed in speech , 2004, INTERSPEECH.

[131]  Maurice K. Wong,et al.  Algorithm AS136: A k-means clustering algorithm. , 1979 .

[132]  Oudeyer Pierre-Yves,et al.  The production and recognition of emotions in speech: features and algorithms , 2003 .

[133]  S. R. Mahadeva Prasanna,et al.  Determination of Instants of Significant Excitation in Speech Using Hilbert Envelope and Group Delay Function , 2007, IEEE Signal Processing Letters.

[134]  Bayya Yegnanarayana,et al.  Intonation modeling for Indian languages , 2009, Comput. Speech Lang..

[135]  B. Yegnanarayana,et al.  Epoch extraction from linear prediction residual for identification of closed glottis interval , 1979 .

[136]  Roddy Cowie,et al.  Acoustic correlates of emotion dimensions in view of speech synthesis , 2001, INTERSPEECH.

[137]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[138]  Beatrice Gralton,et al.  Washington DC - USA , 2008 .

[139]  R. Plutchik,et al.  Theories of emotion , 1980 .

[140]  Maria Schubiger English intonation, its form and function , 1958 .

[141]  Q.Y. Hong,et al.  A discriminative training approach for text-independent speaker recognition , 2005, Signal Process..

[142]  Hynek Hermansky,et al.  Enhancement of reverberant speech using LP residual , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[143]  Nick Campbell,et al.  A corpus-based speech synthesis system with emotion , 2003, Speech Commun..

[144]  Jacob Benesty,et al.  Springer handbook of speech processing , 2007, Springer Handbooks.

[145]  B. Atal,et al.  Speech analysis and synthesis by linear prediction of the speech wave. , 1971, The Journal of the Acoustical Society of America.

[146]  J. A. Hartigan,et al.  A k-means clustering algorithm , 1979 .

[147]  Roddy Cowie,et al.  Emotional speech: Towards a new generation of databases , 2003, Speech Commun..

[148]  Sadaoki Furui,et al.  Comparison of speaker recognition methods using statistical features and dynamic features , 1981 .

[149]  Dimitrios Ververidis,et al.  A State of the Art Review on Emotional Speech Databases , 2003 .