Convoluted Feelings Convolutional and recurrent nets for detecting emotion from audio data

We present a Convolutional Neural Network (CNN) model to extract features from audio data. We apply this model to the task of classifying emotion from speech data. Classification on features extracted from our single-layer CNN model (1024 filters) outperforms classification on typically extracted acoustic features. We achieve an accuracy of 50% for 7-class classification using CNN-extracted features on 500ms audio patches. In addition, we predict emotion with 70% accuracy using LSTMs (Long Short Term Memory) to get context over time.

[1]  Yanning Zhang,et al.  Hybrid Deep Neural Network--Hidden Markov Model (DNN-HMM) Based Speech Emotion Recognition , 2013, 2013 Humaine Association Conference on Affective Computing and Intelligent Interaction.

[2]  Björn W. Schuller,et al.  On-line emotion recognition in a 3-D activation-valence-time continuum using acoustic and linguistic cues , 2009, Journal on Multimodal User Interfaces.

[3]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[4]  Christopher Joseph Pal,et al.  EmoNets: Multimodal deep learning approaches for emotion recognition in video , 2015, Journal on Multimodal User Interfaces.

[5]  Yongzhao Zhan,et al.  Speech Emotion Recognition Using CNN , 2014, ACM Multimedia.

[6]  Trevor Darrell,et al.  Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Jürgen Schmidhuber,et al.  Bidirectional LSTM Networks for Improved Phoneme Classification and Recognition , 2005, ICANN.