Multiscale Amplitude Feature and Significance of Enhanced Vocal Tract Information for Emotion Classification

In this paper, a novel multiscale amplitude feature is proposed using multiresolution analysis (MRA) and the significance of the vocal tract is investigated for emotion classification from the speech signal. MRA decomposes the speech signal into number of sub-band signals. The proposed feature is computed by using sinusoidal model on each sub-band signal. Different emotions have different impacts on the vocal tract. As a result, vocal tract responds in a unique way for each emotion. The vocal tract information is enhanced using pre-emphasis. Therefore, emotion information manifested in the vocal tract can be well exploited. This may help in improving the performance of emotion classification. Emotion recognition is performed using German emotional EMODB database, interactive emotional dyadic motion capture database, simulated stressed speech database, and FAU AIBO database with speech signal and speech with enhanced vocal tract information (SEVTI). The performance of the proposed multiscale amplitude feature is compared with three different types of features: 1) the mel frequency cepstral coefficients; 2) the Teager energy operator (TEO)-based feature (TEO-CB-Auto-Env); and 3) the breathinesss feature. The proposed feature outperforms the other features. In terms of recognition rates, the features derived from the SEVTI signal, give better performance compared to the features derived from the speech signal. Combination of the features with SEVTI signal shows average recognition rate of 86.7% using EMODB database.

[1]  Carlos Busso,et al.  IEMOCAP: interactive emotional dyadic motion capture database , 2008, Lang. Resour. Evaluation.

[2]  A. Nejat İnce Digital speech processing : speech coding, synthesis, and recognition , 1992 .

[3]  Stan Davis,et al.  Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Se , 1980 .

[4]  Carlos Busso,et al.  Compensating for speaker or lexical variabilities in speech for emotion recognition , 2014, Speech Commun..

[5]  Yoon Keun Kwak,et al.  Improved Emotion Recognition With a Novel Speaker-Independent Feature , 2009, IEEE/ASME Transactions on Mechatronics.

[6]  Pierre Dumouchel,et al.  Emotion recognition from speech: WOC-NN and class-interaction , 2012, 2012 11th International Conference on Information Science, Signal Processing and their Applications (ISSPA).

[7]  Eduardo Castillo Guerra,et al.  Automatic Modeling of Acoustic Perception of Breathiness in Pathological Voices , 2009, IEEE Transactions on Biomedical Engineering.

[8]  Kemal Polat,et al.  A new feature selection method on classification of medical datasets: Kernel F-score feature selection , 2009, Expert Syst. Appl..

[9]  Lawrence R. Rabiner,et al.  On the use of autocorrelation analysis for pitch detection , 1977 .

[10]  Hiok Chai Quek,et al.  Cultural dependency analysis for understanding speech emotion , 2012, Expert Syst. Appl..

[11]  Andreas Wendemuth,et al.  Modeling phonetic pattern variability in favor of the creation of robust emotion classifiers for real-life applications , 2014, Comput. Speech Lang..

[12]  John H. L. Hansen,et al.  Nonlinear feature based classification of speech under stress , 2001, IEEE Trans. Speech Audio Process..

[13]  S. R. Mahadeva Prasanna,et al.  Spectral slope based analysis and classification of stressed speech , 2011, Int. J. Speech Technol..

[14]  Pierre Dumouchel,et al.  Anchor Models for Emotion Recognition from Speech , 2013, IEEE Transactions on Affective Computing.

[15]  John Cosmas,et al.  Time-Delay Neural Network for Continuous Emotional Dimension Prediction From Facial Expression Sequences , 2016, IEEE Transactions on Cybernetics.

[16]  Björn W. Schuller,et al.  AVEC 2012: the continuous audio/visual emotion challenge , 2012, ICMI '12.

[17]  Lukás Burget,et al.  Brno University of Technology system for Interspeech 2009 emotion challenge , 2009, INTERSPEECH.

[18]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[19]  Robert I. Damper,et al.  Classification of emotional speech using 3DEC hierarchical classifier , 2012, Speech Commun..

[20]  Marie Tahon,et al.  Towards a Small Set of Robust Acoustic Features for Emotion Recognition: Challenges , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[21]  Robert I. Damper,et al.  On Acoustic Emotion Recognition: Compensating for Covariate Shift , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[22]  Jukka Kortelainen,et al.  Classifier-based learning of nonlinear feature manifold for visualization of emotional speech prosody , 2013, IEEE Transactions on Affective Computing.

[23]  Samarendra Dandapat,et al.  A novel breathiness feature for analysis and classification of speech under stress , 2015, 2015 Twenty First National Conference on Communications (NCC).

[24]  Günes Karabulut-Kurt,et al.  Perceptual audio features for emotion detection , 2012, EURASIP J. Audio Speech Music. Process..

[25]  Rafael A. Calvo,et al.  Affect Detection: An Interdisciplinary Review of Models, Methods, and Their Applications , 2010, IEEE Transactions on Affective Computing.

[26]  Samarendra Dandapat,et al.  Emotion Classification Using Segmentation of Vowel-Like and Non-Vowel-Like Regions , 2019, IEEE Transactions on Affective Computing.

[27]  Ning An,et al.  Speech Emotion Recognition Using Fourier Parameters , 2015, IEEE Transactions on Affective Computing.

[28]  Xuelong Li,et al.  Exploring Web Images to Enhance Skin Disease Analysis Under A Computer Vision Framework , 2018, IEEE Transactions on Cybernetics.

[29]  Masato Akagi,et al.  Cross-lingual speech emotion recognition system based on a three-layer model for human perception , 2013, 2013 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference.

[30]  Astrid Paeschke,et al.  A database of German emotional speech , 2005, INTERSPEECH.

[31]  Björn W. Schuller,et al.  The Geneva Minimalistic Acoustic Parameter Set (GeMAPS) for Voice Research and Affective Computing , 2016, IEEE Transactions on Affective Computing.

[32]  Thomas F. Quatieri,et al.  An approach to co-channel talker interference suppression using a sinusoidal model for speech , 1990, IEEE Trans. Acoust. Speech Signal Process..

[33]  Ye Tian,et al.  Adaptive compensation algorithm in open vocabulary mandarin speaker-independent speech recognition , 2012 .

[34]  Chih-Jen Lin,et al.  A comparison of methods for multiclass support vector machines , 2002, IEEE Trans. Neural Networks.

[35]  Thomas F. Quatieri,et al.  Speech analysis/Synthesis based on a sinusoidal representation , 1986, IEEE Trans. Acoust. Speech Signal Process..

[36]  Björn W. Schuller,et al.  Unsupervised learning in cross-corpus acoustic emotion recognition , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.

[37]  Rosângela Coelho,et al.  Time-Frequency Feature and AMS-GMM Mask for Acoustic Emotion Classification , 2014, IEEE Signal Processing Letters.

[38]  Andreas Wendemuth,et al.  Processing affected speech within human machine interaction , 2009, INTERSPEECH.

[39]  Fabio Paternò,et al.  Speaker-independent emotion recognition exploiting a psychologically-inspired binary cascade classification schema , 2012, International Journal of Speech Technology.

[40]  Carlos Busso,et al.  Emotion recognition using a hierarchical binary decision tree approach , 2011, Speech Commun..

[41]  Christos D. Katsis,et al.  Toward Emotion Recognition in Car-Racing Drivers: A Biosignal Processing Approach , 2008, IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans.

[42]  K. Sreenivasa Rao,et al.  Speaker dependent, speaker independent and cross language emotion recognition from speech using GMM and HMM , 2013, 2013 National Conference on Communications (NCC).

[43]  Arman Savran,et al.  Temporal Bayesian Fusion for Affect Sensing: Combining Video, Audio, and Lexical Modalities , 2015, IEEE Transactions on Cybernetics.

[44]  Raymond N. J. Veldhuis,et al.  Extraction of vocal-tract system characteristics from speech signals , 1998, IEEE Trans. Speech Audio Process..

[45]  Yang Liu,et al.  A Multi-Task Learning Framework for Emotion Recognition Using 2D Continuous Space , 2017, IEEE Transactions on Affective Computing.

[46]  John H. L. Hansen,et al.  A comparative study of traditional and newly proposed features for recognition of speech under stress , 2000, IEEE Trans. Speech Audio Process..

[47]  Ingo Siegert,et al.  Vowels formants analysis allows straightforward detection of high arousal emotions , 2011, 2011 IEEE International Conference on Multimedia and Expo.

[48]  John H. L. Hansen,et al.  Speech enhancement using a constrained iterative sinusoidal model , 2001, IEEE Trans. Speech Audio Process..

[49]  Björn W. Schuller,et al.  The INTERSPEECH 2009 emotion challenge , 2009, INTERSPEECH.

[50]  Björn W. Schuller,et al.  Acoustic emotion recognition: A benchmark comparison of performances , 2009, 2009 IEEE Workshop on Automatic Speech Recognition & Understanding.

[51]  Axel Röbel,et al.  Adaptive Threshold Determination for Spectral Peak Classification , 2008, Computer Music Journal.

[52]  Björn W. Schuller,et al.  Recognising realistic emotions and affect in speech: State of the art and lessons learnt from the first challenge , 2011, Speech Commun..

[53]  Zhen Li,et al.  Recognizing Emotions From an Ensemble of Features , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[54]  Samarendra Dandapat,et al.  Classification of speech under stress using harmonic peak to energy ratio , 2016, Comput. Electr. Eng..

[55]  Ragini Verma,et al.  Class-level spectral features for emotion recognition , 2010, Speech Commun..

[56]  Loïc Kessous,et al.  Whodunnit - Searching for the most important feature types signalling emotion-related user states in speech , 2011, Comput. Speech Lang..