Classification of Emotive Expression Using Verbal and Non Verbal Components of Speech

This paper presents an approach for enabling emotive expression classification through speech analysis combining affective prosody (Mel-frequency Cepstral Coefficient, Zero Crossing Rate, Chroma Energy Normalised) and semantic analysis (Bag-of-words model). Two machine learning (ML) classifiers, a convolutional neural network (CNN) and logistic regression (LR) model, are combined to form an ensemble based approach for the classification of emotive expressions from multi-modal data (audio, text). The approach builds upon existing work in emotion classification, sentiment analysis, and natural language processing (NLP) techniques. The paper contributes a workflow for the comparative analysis of the efficacy in classifying emotion from multi-modal data. The results demonstrate mixed accuracy across varied data sources, indicating the limitations and considerations of a generalised approach. The approach utilised has direct benefits for Affective computing research as it enables (a) insight into the strengths and limitations of such models in correctly classifying emotion in relation to population differences (e.g. gender) and provides (b) a baseline for emotion classification in speech across the six canonical basic emotions (Anger, Fear, Disgust, Joy, Sadness, Surprise).