Integrating Recurrence Dynamics for Speech Emotion Recognition

We investigate the performance of features that can capture nonlinear recurrence dynamics embedded in the speech signal for the task of Speech Emotion Recognition (SER). Reconstruction of the phase space of each speech frame and the computation of its respective Recurrence Plot (RP) reveals complex structures which can be measured by performing Recurrence Quantification Analysis (RQA). These measures are aggregated by using statistical functionals over segment and utterance periods. We report SER results for the proposed feature set on three databases using different classification methods. When fusing the proposed features with traditional feature sets, we show an improvement in unweighted accuracy of up to 5.7% and 10.7% on Speaker-Dependent (SD) and Speaker-Independent (SI) SER tasks, respectively, over the baseline. Following a segment-based approach we demonstrate state-of-the-art performance on IEMOCAP using a Bidirectional Recurrent Neural Network.

[1]  Maja J. Mataric,et al.  A Framework for Automatic Human Emotion Classification Using Emotion Profiles , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[2]  Petros Maragos,et al.  Emotion classification of speech using modulation features , 2014, 2014 22nd European Signal Processing Conference (EUSIPCO).

[3]  Cataldo Guaragnella,et al.  Exploring Recurrence Properties of Vowels for Analysis of Emotions in Speech , 2016 .

[4]  Che-Wei Huang,et al.  Attention Assisted Discovery of Sub-Utterance Structure in Speech Emotion Recognition , 2016, INTERSPEECH.

[5]  Jürgen Kurths,et al.  Unified functional network and nonlinear time series analysis for complex systems science: The pyunicorn package. , 2015, Chaos.

[6]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[7]  Norbert Marwan,et al.  Analysing spatially extended high-dimensional dynamics by recurrence plots , 2014, 1411.6159.

[8]  Ali Shahzadi,et al.  Speech emotion recognition using nonlinear dynamics features , 2015 .

[9]  Philip J. B. Jackson,et al.  Speaker-dependent audio-visual emotion recognition , 2009, AVSP.

[10]  Fakhri Karray,et al.  Survey on speech emotion recognition: Features, classification schemes, and databases , 2011, Pattern Recognit..

[11]  Yang Liu,et al.  A Multi-Task Learning Framework for Emotion Recognition Using 2D Continuous Space , 2017, IEEE Transactions on Affective Computing.

[12]  Elliot Moore,et al.  Investigating Glottal Parameters and Teager Energy Operators in Emotion Recognition , 2011, ACII.

[13]  Shrikanth S. Narayanan,et al.  Toward detecting emotions in spoken dialogs , 2005, IEEE Transactions on Speech and Audio Processing.

[14]  Margaret Lech,et al.  Evaluating deep learning architectures for Speech Emotion Recognition , 2017, Neural Networks.

[15]  Fraser,et al.  Independent coordinates for strange attractors from mutual information. , 1986, Physical review. A, General physics.

[16]  Jürgen Kurths,et al.  Recurrence plots for the analysis of complex systems , 2009 .

[17]  Jürgen Kurths,et al.  Influence of observational noise on the recurrence quantification analysis , 2002 .

[18]  J. Zbilut,et al.  2 Recurrence Quantification Analysis of Nonlinear Dynamical Systems , 2004 .

[19]  Louis-Philippe Morency,et al.  Representation Learning for Speech Emotion Recognition , 2016, INTERSPEECH.

[20]  Yongzhao Zhan,et al.  Learning Salient Features for Speech Emotion Recognition Using Convolutional Neural Networks , 2014, IEEE Transactions on Multimedia.

[21]  Björn W. Schuller,et al.  The INTERSPEECH 2010 paralinguistic challenge , 2010, INTERSPEECH.

[22]  Petros Maragos,et al.  Analysis and classification of speech signals by generalized fractal dimension features , 2009, Speech Commun..

[23]  H. Abarbanel,et al.  Determining embedding dimension for phase-space reconstruction using a geometrical construction. , 1992, Physical review. A, Atomic, molecular, and optical physics.

[24]  Guihua Wen,et al.  Weighted spectral features based on local Hu moments for speech emotion recognition , 2015, Biomed. Signal Process. Control..

[25]  D. Ruelle,et al.  Recurrence Plots of Dynamical Systems , 1987 .

[26]  Hanspeter Herzel,et al.  Bifurcations and Chaos in Voice Signals , 1993 .

[27]  Seyedmahdad Mirsamadi,et al.  Automatic speech emotion recognition using recurrent neural networks with local attention , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[28]  Astrid Paeschke,et al.  A database of German emotional speech , 2005, INTERSPEECH.

[29]  Carlos Busso,et al.  IEMOCAP: interactive emotional dyadic motion capture database , 2008, Lang. Resour. Evaluation.

[30]  Björn W. Schuller,et al.  Timing levels in segment-based speech emotion recognition , 2006, INTERSPEECH.

[31]  Björn W. Schuller,et al.  Recent developments in openSMILE, the munich open-source multimedia feature extractor , 2013, ACM Multimedia.

[32]  Efthymios Tzinis,et al.  Segment-based speech emotion recognition using recurrent neural networks , 2017, 2017 Seventh International Conference on Affective Computing and Intelligent Interaction (ACII).

[33]  Guihua Wen,et al.  Ensemble softmax regression model for speech emotion recognition , 2017, Multimedia Tools and Applications.

[34]  Norbert Marwan,et al.  Selection of recurrence threshold for signal detection , 2008 .

[35]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[36]  Emily Mower Provost,et al.  Using regional saliency for speech emotion recognition , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).