Deep learning-based late fusion of multimodal information for emotion classification of music video

Affective computing is an emerging area of research that aims to enable intelligent systems to recognize, feel, infer and interpret human emotions. The widely spread online and off-line music videos are one of the rich sources of human emotion analysis because it integrates the composer’s internal feeling through song lyrics, musical instruments performance and visual expression. In general, the metadata which music video customers to choose a product includes high-level semantics like emotion so that automatic emotion analysis might be necessary. In this research area, however, the lack of a labeled dataset is a major problem. Therefore, we first construct a balanced music video emotion dataset including diversity of territory, language, culture and musical instruments. We test this dataset over four unimodal and four multimodal convolutional neural networks (CNN) of music and video. First, we separately fine-tuned each pre-trained unimodal CNN and test the performance on unseen data. In addition, we train a 1-dimensional CNN-based music emotion classifier with raw waveform input. The comparative analysis of each unimodal classifier over various optimizers is made to find the best model that can be integrate into a multimodal structure. The best unimodal modality is integrated with corresponding music and video network features for multimodal classifier. The multimodal structure integrates whole music video features and makes final classification with the SoftMax classifier by a late feature fusion strategy. All possible multimodal structures are also combined into one predictive model to get the overall prediction. All the proposed multimodal structure uses cross-validation to overcome the data scarcity problem (overfitting) at the decision level. The evaluation results using various metrics show a boost in the performance of the multimodal architectures compared to each unimodal emotion classifier. The predictive model by integration of all multimodal structure achieves 88.56% in accuracy, 0.88 in f1-score, and 0.987 in area under the curve (AUC) score. The result suggests human high-level emotions are automatically well classified in the proposed CNN-based multimodal networks, even though a small amount of labeled data samples is available for training.

[1]  Léon Bottou,et al.  Large-Scale Machine Learning with Stochastic Gradient Descent , 2010, COMPSTAT.

[2]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[3]  Juhan Nam,et al.  SampleCNN: End-to-End Deep Convolutional Neural Networks Using Very Small Filters for Music Classification , 2018 .

[4]  ChandranVinod,et al.  Representation of facial expression categories in continuous arousal-valence space , 2014 .

[5]  Richard Hans Robert Hahnloser,et al.  Digital selection and analogue amplification coexist in a cortex-inspired silicon circuit , 2000, Nature.

[6]  Louis-Philippe Morency,et al.  Multimodal Machine Learning: A Survey and Taxonomy , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[7]  Jordan J. Bird,et al.  A Study on CNN Transfer Learning for Image Classification , 2018, UKCI.

[8]  Rohit Prasad,et al.  Robust EEG emotion classification using segment level decision fusion , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[9]  Joonwhoan Lee,et al.  Domestic Cat Sound Classification Using Transfer Learning , 2018, Int. J. Fuzzy Log. Intell. Syst..

[10]  Erik Cambria,et al.  A review of affective computing: From unimodal analysis to multimodal fusion , 2017, Inf. Fusion.

[11]  Chao Yang,et al.  A Survey on Deep Transfer Learning , 2018, ICANN.

[12]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[13]  Yi-Hsuan Yang,et al.  Machine Recognition of Music Emotion: A Review , 2012, TIST.

[14]  Vinod Chandran,et al.  Representation of facial expression categories in continuous arousal-valence space: Feature and correlation , 2014, Image Vis. Comput..

[15]  Alan S. Cowen,et al.  Self-report captures 27 distinct categories of emotion bridged by continuous gradients , 2017, Proceedings of the National Academy of Sciences.

[16]  Ping Lu,et al.  Audio-visual emotion fusion (AVEF): A deep efficient weighted approach , 2019, Inf. Fusion.

[17]  Junqing Yu,et al.  An improved valence-arousal emotion space for video affective content representation and recognition , 2009, 2009 IEEE International Conference on Multimedia and Expo.

[18]  Shiliang Zhang,et al.  Multimodal Deep Convolutional Neural Network for Audio-Visual Emotion Recognition , 2016, ICMR.

[19]  Gholamreza Anbarjafari,et al.  Vocal-based emotion recognition using random forests and decision tree , 2017, International Journal of Speech Technology.

[20]  J. Russell A circumplex model of affect. , 1980 .

[21]  Albert Ali Salah,et al.  Video-based emotion recognition in the wild using deep transfer learning and score fusion , 2017, Image Vis. Comput..

[22]  Christopher Joseph Pal,et al.  EmoNets: Multimodal deep learning approaches for emotion recognition in video , 2015, Journal on Multimodal User Interfaces.

[23]  Deyu Wang,et al.  Cognitive-inspired class-statistic matching with triple-constrain for camera free 3D object retrieval , 2019, Future Gener. Comput. Syst..

[24]  Samit Bhattacharya,et al.  Using Deep and Convolutional Neural Networks for Accurate Emotion Classification on DEAP Dataset , 2017, AAAI.

[25]  Jacek Grekow From Content-based Music Emotion Recognition to Emotion Maps of Musical Pieces , 2018, Studies in Computational Intelligence.

[26]  H. Lövheim A new three-dimensional model for emotions and monoamine neurotransmitters. , 2012, Medical hypotheses.

[27]  Joonwhoan Lee,et al.  Domestic Cat Sound Classification Using Learned Features from Deep Neural Nets , 2018, Applied Sciences.

[28]  Sridha Sridharan,et al.  Deep Spatio-Temporal Features for Multimodal Emotion Recognition , 2017, 2017 IEEE Winter Conference on Applications of Computer Vision (WACV).

[29]  Shu-Ching Chen,et al.  Multimodal deep representation learning for video classification , 2018, World Wide Web.

[30]  CambriaErik,et al.  A review of affective computing , 2017 .