Robust Acoustic Emotion Recognition Based on Cascaded Normalization and Extreme Learning Machines

One of the challenges in speech emotion recognition is robust and speaker-independent emotion recognition. In this paper, we take a cascaded normalization approach, combining linear speaker level, nonlinear value level and feature vector level normalization to minimize speaker-related effects and to maximize class separability with linear kernel classifiers. We use extreme learning machine classifiers on a four class (i.e. joy, anger, sadness, neutral) problem. We show the efficacy of our proposed method on the recently collected Turkish Emotional Speech Database.

[1]  Shrikanth S. Narayanan,et al.  Classification of cognitive load from speech using an i-vector framework , 2014, INTERSPEECH.

[2]  Albert Ali Salah,et al.  Contrasting and Combining Least Squares Based Learners for Emotion Recognition in the Wild , 2015, ICMI.

[3]  Fabio Valente,et al.  The INTERSPEECH 2013 computational paralinguistics challenge: social signals, conflict, emotion, autism , 2013, INTERSPEECH.

[4]  Johan A. K. Suykens,et al.  Least Squares Support Vector Machine Classifiers , 1999, Neural Processing Letters.

[5]  Vidhyasaharan Sethu,et al.  The UNSW submission to INTERSPEECH 2014 compare cognitive load challenge , 2014, INTERSPEECH.

[6]  Albert Ali Salah,et al.  Protocol and baseline for experiments on Bogazici University Turkish emotional speech corpus , 2014, 2014 22nd Signal Processing and Communications Applications Conference (SIU).

[7]  Guang-Bin Huang,et al.  Extreme learning machine: a new learning scheme of feedforward neural networks , 2004, 2004 IEEE International Joint Conference on Neural Networks (IEEE Cat. No.04CH37541).

[8]  Albert Ali Salah,et al.  Fisher vectors with cascaded normalization for paralinguistic analysis , 2015, INTERSPEECH.

[9]  K. S. Banerjee Generalized Inverse of Matrices and Its Applications , 1973 .

[10]  Thomas Mensink,et al.  Improving the Fisher Kernel for Large-Scale Image Classification , 2010, ECCV.

[11]  Patrick Kenny,et al.  Front-End Factor Analysis for Speaker Verification , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[12]  Björn W. Schuller,et al.  AVEC 2014: 3D Dimensional Affect and Depression Recognition Challenge , 2014, AVEC '14.

[13]  Björn W. Schuller,et al.  The INTERSPEECH 2010 paralinguistic challenge , 2010, INTERSPEECH.

[14]  Hongming Zhou,et al.  Extreme Learning Machine for Regression and Multiclass Classification , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[15]  Konstantin Stanislavsky,et al.  An Actor Prepares , 1936 .

[16]  Chee Kheong Siew,et al.  Extreme learning machine: Theory and applications , 2006, Neurocomputing.

[17]  C. Pelachaud,et al.  Emotion-Oriented Systems: The Humaine Handbook , 2011 .

[18]  Björn Schuller,et al.  Opensmile: the munich versatile and fast open-source audio feature extractor , 2010, ACM Multimedia.

[19]  R. Cowie,et al.  Emotion: Concepts and Definitions , 2011 .