A multimodal convolutional neuro-fuzzy network for emotion understanding of movie clips

Multimodal emotion understanding enables AI systems to interpret human emotions. With accelerated video surge, emotion understanding remains challenging due to inherent data ambiguity and diversity of video content. Although deep learning has made a considerable progress in big data feature learning, they are viewed as deterministic models used in a "black-box" manner which does not have capabilities to represent inherent ambiguities with data. Since the possibility theory of fuzzy logic focuses on knowledge representation and reasoning under uncertainty, we intend to incorporate the concepts of fuzzy logic into deep learning framework. This paper presents a novel convolutional neuro-fuzzy network, which is an integration of convolutional neural networks in fuzzy logic domain to extract high-level emotion features from text, audio, and visual modalities. The feature sets extracted by fuzzy convolutional layers are compared with those of convolutional neural networks at the same level using t-distributed Stochastic Neighbor Embedding. This paper demonstrates a multimodal emotion understanding framework with an adaptive neural fuzzy inference system that can generate new rules to classify emotions. For emotion understanding of movie clips, we concatenate audio, visual, and text features extracted using the proposed convolutional neuro-fuzzy network to train adaptive neural fuzzy inference system. In this paper, we go one step further to explain how deep learning arrives at a conclusion that can guide us to an interpretable AI. To identify which visual/text/audio aspects are important for emotion understanding, we use direct linear non-Gaussian additive model to explain the relevance in terms of causal relationships between features of deep hidden layers. The critical features extracted are input to the proposed multimodal framework to achieve higher accuracy.

[1]  Dong Yu,et al.  Deep Learning: Methods and Applications , 2014, Found. Trends Signal Process..

[2]  Rob Fergus,et al.  Visualizing and Understanding Convolutional Networks , 2013, ECCV.

[3]  Zhang Jian-chao,et al.  A new SVM based emotional classification of image , 2005 .

[4]  Aapo Hyvärinen,et al.  DirectLiNGAM: A Direct Method for Learning a Linear Non-Gaussian Structural Equation Model , 2011, J. Mach. Learn. Res..

[5]  Sukhamay Kundu The min-max composition rule and its superiority over the usual max-min composition rule , 1998, Fuzzy Sets Syst..

[6]  Qingming Huang,et al.  Affective Image Content Analysis: A Comprehensive Survey , 2018, IJCAI.

[7]  J. Itten The art of color : the subjective experience and objective rationale of color , 1973 .

[8]  Xiaoou Tang,et al.  Image Aesthetic Assessment: An experimental survey , 2016, IEEE Signal Processing Magazine.

[9]  Jiebo Luo,et al.  Joint Visual-Textual Sentiment Analysis with Deep Neural Networks , 2015, ACM Multimedia.

[10]  P. Rousseeuw Silhouettes: a graphical aid to the interpretation and validation of cluster analysis , 1987 .

[11]  Amaia Salvador,et al.  Diving Deep into Sentiment: Understanding Fine-tuned CNNs for Visual Sentiment Prediction , 2015, ASM@ACM Multimedia.

[12]  Petros Maragos,et al.  COGNIMUSE: a multimodal video database annotated with saliency, events, semantics and emotion with application to summarization , 2017, EURASIP J. Image Video Process..

[13]  Justin Salamon,et al.  Deep Convolutional Neural Networks and Data Augmentation for Environmental Sound Classification , 2016, IEEE Signal Processing Letters.

[14]  Yilong Yin,et al.  Distribution-Oriented Aesthetics Assessment With Semantic-Aware Hybrid Network , 2019, IEEE Transactions on Multimedia.

[15]  Jiebo Luo,et al.  Robust Image Sentiment Analysis Using Progressively Trained and Domain Transferred Deep Networks , 2015, AAAI.

[16]  Sander Dieleman,et al.  Beyond Temporal Pooling: Recurrence and Temporal Convolutions for Gesture Recognition in Video , 2015, International Journal of Computer Vision.

[17]  Verónica Pérez-Rosas,et al.  Multimodal Sentiment Analysis of Spanish Online Videos , 2013, IEEE Intelligent Systems.

[18]  Yu Ying-lin,et al.  Image Retrieval by Emotional Semantics: A Study of Emotional Space and Feature Extraction , 2006, 2006 IEEE International Conference on Systems, Man and Cybernetics.

[19]  Paulo E. Rauber,et al.  Visualizing the Hidden Activity of Artificial Neural Networks , 2017, IEEE Transactions on Visualization and Computer Graphics.

[20]  Ajay Kumar,et al.  Defect detection in textured materials using Gabor filters , 2000, Conference Record of the 2000 IEEE Industry Applications Conference. Thirty-Fifth IAS Annual Meeting and World Conference on Industrial Applications of Electrical Energy (Cat. No.00CH37129).

[21]  Minho Lee,et al.  A fuzzy convolutional neural network for text sentiment analysis , 2018, J. Intell. Fuzzy Syst..

[22]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Rosalind W. Picard Affective computing: challenges , 2003, Int. J. Hum. Comput. Stud..

[24]  Jonathon S. Hare,et al.  Analyzing and predicting sentiment of images on the social web , 2010, ACM Multimedia.

[25]  Aapo Hyvärinen,et al.  A Linear Non-Gaussian Acyclic Model for Causal Discovery , 2006, J. Mach. Learn. Res..

[26]  Björn W. Schuller,et al.  On-line emotion recognition in a 3-D activation-valence-time continuum using acoustic and linguistic cues , 2009, Journal on Multimodal User Interfaces.

[27]  Björn W. Schuller,et al.  Knowledge-Based Approaches to Concept-Level Sentiment Analysis , 2013, IEEE Intell. Syst..

[28]  Youyong Kong,et al.  A Hierarchical Fused Fuzzy Deep Neural Network for Data Classification , 2017, IEEE Transactions on Fuzzy Systems.

[29]  Erik Cambria,et al.  Fusing audio, visual and textual clues for sentiment analysis from multimodal content , 2016, Neurocomputing.