Image based Emotional State Prediction from Multiparty Audio Conversation

Recognizing human emotion is a complex task and is being researched upon since couple of decades. The problem has still gained popularity because of its need in various domains, when it comes to human computer interaction or human robot interaction. As per researchers, human predict other persons state of mind by observing various parameters, 70% of them being non-verbal. Human have emotions embedded in their speech, pose, gesture, context, facial expressions, and even the past history of conversation or situation. These all sub problems can be beautifully solved using learning based techniques. Predicting emotion in multi party audio based conversation aids complexity to the problem, which needs to predict intent of speech, culture, accent of talking, gender and many other diversities. There are various attempts made by researchers to classify human audio into required classes, using Support Vector Machine model, Long Short Term Memeory (LSTM) and bi-LSTM on audio input. We propose an image based emotional classification approach for an audio conversation. Spectrogram of an audio signal plotted as an image is used as a input to Convolutional Neural Network model obtaining the pattern for classification. The proposed approach is able to provide an accuracy of around 86% on test dataset, which is considerable improvement over state of the art models.

[1]  Muljono,et al.  Speech Emotion Recognition of Indonesian Movie Audio Tracks based on MFCC and SVM , 2019, 2019 International Conference on contemporary Computing and Informatics (IC3I).

[2]  Shruti Jaiswal,et al.  Robust real-time emotion detection system using CNN architecture , 2019, Neural Computing and Applications.

[3]  Rajib Mall,et al.  Emotion recognition from audio, dimensional and discrete categorization using CNNs , 2019, TENCON 2019 - 2019 IEEE Region 10 Conference (TENCON).

[4]  A. Akila,et al.  An Enhanced Human Speech Emotion Recognition Using Hybrid of PRNN and KNN , 2019, 2019 International Conference on Machine Learning, Big Data, Cloud and Parallel Computing (COMITCon).

[5]  Aditya Khamparia,et al.  Sound Classification Using Convolutional Neural Network and Tensor Deep Stacking Network , 2019, IEEE Access.

[6]  Rada Mihalcea,et al.  MELD: A Multimodal Multi-Party Dataset for Emotion Recognition in Conversations , 2018, ACL.

[7]  Klaus-Robert Müller,et al.  Interpreting and Explaining Deep Neural Networks for Classification of Audio Signals , 2018, ArXiv.

[8]  Björn W. Schuller,et al.  Speech emotion recognition , 2018, Commun. ACM.

[9]  Chung-Hsien Wu,et al.  Emotion Recognition of Affective Speech Based on Multiple Classifiers Using Acoustic-Prosodic Information and Semantic Labels , 2015, IEEE Transactions on Affective Computing.

[10]  Mustafa Sert,et al.  Efficient Recognition of Human Emotional States from Audio Signals , 2014, 2014 IEEE International Symposium on Multimedia.

[11]  Kah Phooi Seng,et al.  A new approach of audio emotion recognition , 2014, Expert Syst. Appl..

[12]  Yunde Jia,et al.  Audio-visual emotion recognition using Boltzmann Zippers , 2012, ICIP.

[13]  Chih-Chang Huang,et al.  Image processing based emotion recognition , 2010, 2010 International Conference on System Science and Engineering.

[14]  Nicu Sebe,et al.  Emotion Recognition Based on Joint Visual and Audio Cues , 2006, 18th International Conference on Pattern Recognition (ICPR'06).