Deep Coordinated Textual and Visual Network for Sentiment-Oriented Cross-Modal Retrieval

Cross-modal retrieval has attracted more and more attention recently, which enables people to retrieve desired information efficiently from a large amount of multimedia data. Most methods on cross-modal retrieval only focus on aligning the objects in image and text, while sentiment alignment is also essential for facilitating various applications, e.g., entertainment, advertisement, etc. This paper studies the problem of retrieving visual sentiment concepts with a goal to extract sentiment-oriented information from social multimedia content, i.e., sentiment oriented cross-media retrieval. Such problem is inherently challenging due to the subjective and ambiguity characteristics of the adjectives like “sad” and “awesome”. Thus, we focus on modeling visual sentiment concepts with adjective-noun pairs, e.g., “sad dog” and “awesome flower”, where associating adjectives with concrete objects makes the concepts more tractable. This paper proposes a deep coordinated textural and visual network with two branches to learn a joint semantic embedding space for both images and texts. The visual branch is based on a convolutional neural network (CNN) pre-trained on a large dataset, which is optimized with the classification loss. The textual branch is added on the fully-connected layer providing supervision of the textual semantic space. In order to learn the coordinated representation for different modalities, the multi-task loss function is optimized during the end-to-end training process. We have conducted extensive experiments on a subset of the large-scale VSO dataset. The results show that the proposed model is able to retrieval sentiment-oriented data, which performs favorably against the state-of-the-art methods.

[1]  Jie Tang,et al.  Can we understand van gogh's mood?: learning to infer affects from images in social networks , 2012, ACM Multimedia.

[2]  Trevor Darrell,et al.  Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[3]  C. V. Jawahar,et al.  Multi-label Cross-Modal Retrieval , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[4]  Ruifan Li,et al.  Cross-modal Retrieval with Correspondence Autoencoder , 2014, ACM Multimedia.

[5]  Lior Wolf,et al.  Associating neural word embeddings with deep image representations using Fisher Vectors , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Allan Hanbury,et al.  Affective image classification using features inspired by psychology and art theory , 2010, ACM Multimedia.

[7]  Xiaohua Zhai,et al.  Heterogeneous Metric Learning with Joint Graph Regularization for Cross-Media Retrieval , 2013, AAAI.

[8]  Krystian Mikolajczyk,et al.  Deep correlation for matching images and text , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[10]  Ming-Hsuan Yang,et al.  Weakly Supervised Coupled Networks for Visual Sentiment Analysis , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[11]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[12]  Ishwar K. Sethi,et al.  Multimedia content processing through cross-modal association , 2003, MULTIMEDIA '03.

[13]  Stefan Carlsson,et al.  CNN Features Off-the-Shelf: An Astounding Baseline for Recognition , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops.

[14]  Beng Chin Ooi,et al.  Effective deep learning-based multi-modal retrieval , 2015, The VLDB Journal.

[15]  Tao Chen,et al.  Object-Based Visual Sentiment Concept Analysis and Application , 2014, ACM Multimedia.

[16]  Roger Levy,et al.  A new approach to cross-modal multimedia retrieval , 2010, ACM Multimedia.

[17]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[18]  Paul L. Rosin,et al.  Visual Sentiment Prediction Based on Automatic Discovery of Affective Regions , 2018, IEEE Transactions on Multimedia.

[19]  Juhan Nam,et al.  Multimodal Deep Learning , 2011, ICML.

[20]  Yuxin Peng,et al.  Cross-Media Shared Representation by Hierarchical Learning with Multiple Deep Networks , 2016, IJCAI.

[21]  Xiaohua Zhai,et al.  Learning Cross-Media Joint Representation With Sparse and Semisupervised Regularization , 2014, IEEE Transactions on Circuits and Systems for Video Technology.

[22]  Nitish Srivastava,et al.  Multimodal learning with deep Boltzmann machines , 2012, J. Mach. Learn. Res..

[23]  Jiebo Luo,et al.  Aesthetics and Emotions in Images , 2011, IEEE Signal Processing Magazine.

[24]  Roger Levy,et al.  On the Role of Correlation and Abstraction in Cross-Modal Multimedia Retrieval , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[25]  Quoc-Tuan Truong,et al.  Visual Sentiment Analysis for Review Images with Item-Oriented and User-Oriented CNN , 2017, ACM Multimedia.

[26]  Ming-Hsuan Yang,et al.  Retrieving and Classifying Affective Images via Deep Metric Learning , 2018, AAAI.

[27]  Michael Isard,et al.  A Multi-View Embedding Space for Modeling Internet Images, Tags, and Their Semantics , 2012, International Journal of Computer Vision.

[28]  Tieniu Tan,et al.  Joint Feature Selection and Subspace Learning for Cross-Modal Retrieval , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[29]  Jufeng Yang,et al.  Joint Image Emotion Classification and Distribution Learning via Deep Convolutional Neural Network , 2017, IJCAI.

[30]  Jeff A. Bilmes,et al.  Deep Canonical Correlation Analysis , 2013, ICML.

[31]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[32]  Yao Zhao,et al.  Enhanced isomorphic semantic representation for cross-media retrieval , 2017, 2017 IEEE International Conference on Multimedia and Expo (ICME).

[33]  Jufeng Yang,et al.  Learning Visual Sentiment Distributions via Augmented Conditional Probability Neural Network , 2017, AAAI.

[34]  Jiebo Luo,et al.  Building a Large Scale Dataset for Image Emotion Recognition: The Fine Print and The Benchmark , 2016, AAAI.