Vowels formants analysis allows straightforward detection of high arousal emotions

Recently, automatic emotion recognition from speech has achieved growing interest within the human-machine interaction research community. Most part of emotion recognition methods use context independent frame-level analysis or turn-level analysis. In this article, we introduce context dependent vowel level analysis applied for emotion classification. An average first formant value extracted on vowel level has been used as unidimensional acoustic feature vector. The Neyman-Pearson criterion has been used for classification purpose. Our classifier is able to detect high-arousal emotions with small error rates. Within our research we proved that the smallest emotional unit should be the vowel instead of the word. We find out that using vowel level analysis can be an important issue during developing a robust emotion classifier. Also, our research can be useful for developing robust affective speech recognition methods and high quality emotional speech synthesis systems.

[1]  Björn W. Schuller,et al.  Combining speech recognition and acoustic word emotion models for robust text-independent emotion recognition , 2008, 2008 IEEE International Conference on Multimedia and Expo.

[2]  Björn W. Schuller,et al.  Segmenting into Adequate Units for Automatic Recognition of Emotion-Related Episodes: A Speech-Based Approach , 2010, Adv. Hum. Comput. Interact..

[3]  Zhigang Deng,et al.  Emotion recognition based on phoneme classes , 2004, INTERSPEECH.

[4]  J. G. Taylor,et al.  Emotion recognition in human-computer interaction , 2005, Neural Networks.

[5]  Klaus R. Scherer,et al.  Emotion dimensions and formant position , 2009, INTERSPEECH.

[6]  G. E. Peterson,et al.  Control Methods Used in a Study of the Vowels , 1951 .

[7]  Florian Schiel,et al.  The Bavarian Archive for Speech Signals , 1997 .

[8]  George N. Votsis,et al.  Emotion recognition in human-computer interaction , 2001, IEEE Signal Process. Mag..

[9]  Caren Brinckmann,et al.  THE “ KIEL CORPUS OF READ SPEECH ” AS A RESOURCE FOR PROSODY PREDICTION IN SPEECH SYNTHESIS , 2005 .

[10]  Andreas Wendemuth,et al.  Heading toward to the natural way of human-machine interaction: the nimitek project , 2009, 2009 IEEE International Conference on Multimedia and Expo.

[11]  Paul Boersma,et al.  Praat, a system for doing phonetics by computer , 2002 .

[12]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[13]  Carlos Busso,et al.  Using neutral speech models for emotional speech analysis , 2007, INTERSPEECH.

[14]  Steve J. Young,et al.  Data-driven emotion conversion in spoken English , 2009, Speech Commun..

[15]  Astrid Paeschke,et al.  A database of German emotional speech , 2005, INTERSPEECH.

[16]  William J. J. Roberts,et al.  Speaker classification using composite hypothesis testing and list decoding , 2005, IEEE Transactions on Speech and Audio Processing.

[17]  Paul Boersma,et al.  Praat: doing phonetics by computer , 2003 .