Deep Multimodal Learning for Affective Analysis and Retrieval

Social media has been a convenient platform for voicing opinions through posting messages, ranging from tweeting a short text to uploading a media file, or any combination of messages. Understanding the perceived emotions inherently underlying these user-generated contents (UGC) could bring light to emerging applications such as advertising and media analytics. Existing research efforts on affective computation are mostly dedicated to single media, either text captions or visual content. Few attempts for combined analysis of multiple media are made, despite that emotion can be viewed as an expression of multimodal experience. In this paper, we explore the learning of highly non-linear relationships that exist among low-level features across different modalities for emotion prediction. Using the deep Bolzmann machine (DBM), a joint density model over the space of multimodal inputs, including visual, auditory, and textual modalities, is developed. The model is trained directly using UGC data without any labeling efforts. While the model learns a joint representation over multimodal inputs, training samples in absence of certain modalities can also be leveraged. More importantly, the joint representation enables emotion-oriented cross-modal retrieval, for example, retrieval of videos using the text query “crazy cat”. The model does not restrict the types of input and output, and hence, in principle, emotion prediction and retrieval on any combinations of media are feasible. Extensive experiments on web videos and images show that the learnt joint representation could be very compact and be complementary to hand-crafted features, leading to performance improvement in both emotion classification and cross-modal retrieval.

[1]  Antonio Torralba,et al.  Modeling the Shape of the Scene: A Holistic Representation of the Spatial Envelope , 2001, International Journal of Computer Vision.

[2]  Eli Shechtman,et al.  Matching Local Self-Similarities across Images and Videos , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[3]  Quoc V. Le,et al.  Measuring Invariances in Deep Networks , 2009, NIPS.

[4]  Jiebo Luo,et al.  Robust Image Sentiment Analysis Using Progressively Trained and Domain Transferred Deep Networks , 2015, AAAI.

[5]  Xiangyang Xue,et al.  Predicting Emotions in User-Generated Videos , 2014, AAAI.

[6]  Harith Alani,et al.  Alleviating Data Sparsity for Twitter Sentiment Analysis , 2012, #MSM.

[7]  Jiebo Luo,et al.  Aesthetics and Emotions in Images , 2011, IEEE Signal Processing Magazine.

[8]  Shih-Fu Chang,et al.  Predicting Viewer Perceived Emotions in Animated GIFs , 2014, ACM Multimedia.

[9]  Seungmin Rho,et al.  Multimedia and semantic technologies for future computing environments , 2012, Multimedia Tools and Applications.

[10]  Xiangjian He,et al.  Hierarchical affective content analysis in arousal and valence dimensions , 2013, Signal Process..

[11]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[12]  Marcelo Mendoza,et al.  Combining strengths, emotions and polarities for boosting Twitter sentiment analysis , 2013, WISDOM '13.

[13]  Johanna D. Moore,et al.  Twitter Sentiment Analysis: The Good the Bad and the OMG! , 2011, ICWSM.

[14]  Owen Rambow,et al.  Sentiment Analysis of Twitter Data , 2011 .

[15]  Mike Thelwall,et al.  Sentiment in short strength detection informal text , 2010 .

[16]  Juhan Nam,et al.  Multimodal Deep Learning , 2011, ICML.

[17]  Kiyoharu Aizawa,et al.  Determination of emotional content of video clips by low-level audiovisual features , 2011, Multimedia Tools and Applications.

[18]  Geoffrey E. Hinton,et al.  Replicated Softmax: an Undirected Topic Model , 2009, NIPS.

[19]  Matthijs C. Dorst Distinctive Image Features from Scale-Invariant Keypoints , 2011 .

[20]  Geoffrey E. Hinton,et al.  Deep Boltzmann Machines , 2009, AISTATS.

[21]  Geoffrey E. Hinton A Practical Guide to Training Restricted Boltzmann Machines , 2012, Neural Networks: Tricks of the Trade.

[22]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[23]  Yi-Hsuan Yang,et al.  Online Reranking via Ordinal Informative Concepts for Context Fusion in Concept Detection and Video Search , 2009, IEEE Transactions on Circuits and Systems for Video Technology.

[24]  Tijmen Tieleman,et al.  Training restricted Boltzmann machines using approximations to the likelihood gradient , 2008, ICML '08.

[25]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[26]  Tao Chen,et al.  DeepSentiBank: Visual Sentiment Concept Classification with Deep Convolutional Neural Networks , 2014, ArXiv.

[27]  Loong Fah Cheong,et al.  Affective understanding in film , 2006, IEEE Trans. Circuits Syst. Video Technol..

[28]  Yoshua. Bengio,et al.  Learning Deep Architectures for AI , 2007, Found. Trends Mach. Learn..

[29]  Douglas Eck,et al.  Learning Features from Music Audio with Deep Belief Networks , 2010, ISMIR.

[30]  Yue Gao,et al.  Exploring Principles-of-Art Features For Image Emotion Recognition , 2014, ACM Multimedia.

[31]  Matti Pietikäinen,et al.  Multiresolution Gray-Scale and Rotation Invariant Texture Classification with Local Binary Patterns , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[32]  Douglas Eck,et al.  Aggregate features and ADABOOST for music classification , 2006, Machine Learning.

[33]  Mihai Surdeanu,et al.  The Stanford CoreNLP Natural Language Processing Toolkit , 2014, ACL.

[34]  Ruifan Li,et al.  Cross-modal Retrieval with Correspondence Autoencoder , 2014, ACM Multimedia.

[35]  Yi-Hsuan Yang,et al.  Ranking-Based Emotion Recognition for Music Organization and Retrieval , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[36]  Geoffrey E. Hinton,et al.  Using fast weights to improve persistent contrastive divergence , 2009, ICML '09.

[37]  Emmanuel Dellandréa,et al.  A Large Video Database for Computational Models of Induced Emotion , 2013, 2013 Humaine Association Conference on Affective Computing and Intelligent Interaction.

[38]  Yoshua Bengio,et al.  Extracting and composing robust features with denoising autoencoders , 2008, ICML '08.

[39]  Andrew W. Fitzgibbon,et al.  Efficient Object Category Recognition Using Classemes , 2010, ECCV.

[40]  Hao Su,et al.  Object Bank: A High-Level Image Representation for Scene Classification & Semantic Feature Sparsification , 2010, NIPS.

[41]  Thomas Fillon,et al.  YAAFE, an Easy to Use and Efficient Audio Feature Extraction Software , 2010, ISMIR.

[42]  Nitish Srivastava,et al.  Multimodal learning with deep Boltzmann machines , 2012, J. Mach. Learn. Res..

[43]  Rongrong Ji,et al.  Large-scale visual sentiment ontology and detectors using adjective noun pairs , 2013, ACM Multimedia.