Multimodal Learning of Social Image Representation by Exploiting Social Relations

Learning the representation for social images has recently made remarkable achievements for many tasks, such as cross-modal retrieval and multilabel classification. However, since social images contain both multimodal contents (e.g., visual images and textual descriptions) and social relations among images, simply modeling the content information may lead to suboptimal embedding. In this paper, we propose a novel multimodal representation learning model for social images, that is, correlational multimodal variational autoencoder (CMVAE) via triplet network. Specifically, in order to mine the highly nonlinear correlation between the visual content and the textual content, a CMVAE is proposed to learn a unified representation for the multiple modalities of social images. Both common information in all modalities and private information in each modality are encoded for the representation learning. To incorporate the social relations among images, we employ the triplet network to embed multiple types of social links in the representation learning. Then, a joint embedding model is proposed to combine the social relations for representation learning of the multimodal contents. Comprehensive experiment results on four datasets confirm the effectiveness of our method in two tasks, namely, multilabel classification and cross-modal retrieval. Our method outperforms the state-of-the-art multimodal representation learning methods with significant improvement of performance.

[1]  Nitish Srivastava,et al.  Multimodal learning with deep Boltzmann machines , 2012, J. Mach. Learn. Res..

[2]  Tat-Seng Chua,et al.  NUS-WIDE: a real-world web image database from National University of Singapore , 2009, CIVR '09.

[3]  Thomas Brox,et al.  Generating Images with Perceptual Similarity Metrics based on Deep Networks , 2016, NIPS.

[4]  Jure Leskovec,et al.  Image Labeling on a Network: Using Social-Network Metadata for Image Classification , 2012, ECCV.

[5]  Svetlana Lazebnik,et al.  Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[6]  Nir Ailon,et al.  Deep Metric Learning Using Triplet Network , 2014, SIMBAD.

[7]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[8]  Pavlo Molchanov,et al.  Multilayer and Multimodal Fusion of Deep Neural Networks for Video Classification , 2016, ACM Multimedia.

[9]  Marc'Aurelio Ranzato,et al.  DeViSE: A Deep Visual-Semantic Embedding Model , 2013, NIPS.

[10]  Yang Song,et al.  Learning Fine-Grained Image Similarity with Deep Ranking , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[11]  Charu C. Aggarwal,et al.  Heterogeneous Network Embedding via Deep Architectures , 2015, KDD.

[12]  Yuxin Peng,et al.  Cross-Media Shared Representation by Hierarchical Learning with Multiple Deep Networks , 2016, IJCAI.

[13]  Gustavo Carneiro,et al.  Learning Local Image Descriptors with Deep Siamese and Triplet Convolutional Networks by Minimizing Global Loss Functions , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Armand Joulin,et al.  Deep Fragment Embeddings for Bidirectional Image Sentence Mapping , 2014, NIPS.

[15]  Seong-Whan Lee,et al.  Hierarchical feature representation and multimodal fusion with deep learning for AD/MCI diagnosis , 2014, NeuroImage.

[16]  Erik Cambria,et al.  Deep Convolutional Neural Network Textual Features and Multiple Kernel Learning for Utterance-level Multimodal Sentiment Analysis , 2015, EMNLP.

[17]  Hailin Jin,et al.  Collaborative feature learning from social media , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Alexander J. Smola,et al.  Stacked Attention Networks for Image Question Answering , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Feiran Huang,et al.  Learning Social Image Embedding with Deep Multimodal Attention Networks , 2017, ACM Multimedia.

[20]  Kate Saenko,et al.  Ask, Attend and Answer: Exploring Question-Guided Spatial Attention for Visual Question Answering , 2015, ECCV.

[21]  Steven Skiena,et al.  DeepWalk: online learning of social representations , 2014, KDD.

[22]  Luc Van Gool,et al.  The Pascal Visual Object Classes (VOC) Challenge , 2010, International Journal of Computer Vision.

[23]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[24]  Jeff A. Bilmes,et al.  Deep Canonical Correlation Analysis , 2013, ICML.

[25]  Nitish Srivastava Unsupervised Learning of Visual Representations using Videos , 2015 .

[26]  John Shawe-Taylor,et al.  Canonical Correlation Analysis: An Overview with Application to Learning Methods , 2004, Neural Computation.

[27]  Wei Xu,et al.  Deep Captioning with Multimodal Recurrent Neural Networks (m-RNN) , 2014, ICLR.

[28]  Jun Yu,et al.  Local Deep-Feature Alignment for Unsupervised Dimension Reduction , 2018, IEEE Transactions on Image Processing.

[29]  Qi Tian,et al.  Social Embedding Image Distance Learning , 2014, ACM Multimedia.

[30]  Charu C. Aggarwal,et al.  Towards semantic knowledge propagation from text corpus to web images , 2011, WWW.

[31]  Alex Graves,et al.  DRAW: A Recurrent Neural Network For Image Generation , 2015, ICML.

[32]  Phil Blunsom,et al.  Neural Variational Inference for Text Processing , 2015, ICML.

[33]  Zhi-Hua Zhou,et al.  A Unified View of Multi-Label Performance Measures , 2016, ICML.

[34]  Wenwu Zhu,et al.  Learning Compact Hash Codes for Multimodal Representations Using Orthogonal Deep Structure , 2015, IEEE Transactions on Multimedia.

[35]  Jason Weston,et al.  WSABIE: Scaling Up to Large Vocabulary Image Annotation , 2011, IJCAI.

[36]  Meng Wang,et al.  Multimodal Deep Autoencoder for Human Pose Recovery , 2015, IEEE Transactions on Image Processing.

[37]  Yin Li,et al.  Learning Deep Structure-Preserving Image-Text Embeddings , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Fei Gao,et al.  Deep Multimodal Distance Metric Learning Using Click Constraints for Image Ranking , 2017, IEEE Transactions on Cybernetics.

[39]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[40]  Daan Wierstra,et al.  Stochastic Backpropagation and Approximate Inference in Deep Generative Models , 2014, ICML.

[41]  Krystian Mikolajczyk,et al.  Deep correlation for matching images and text , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[42]  James Philbin,et al.  FaceNet: A unified embedding for face recognition and clustering , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[43]  Tat-Seng Chua,et al.  Learning Image and User Features for Recommendation in Social Networks , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[44]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[45]  Tat-Seng Chua,et al.  Online Collaborative Learning for Open-Vocabulary Visual Classifiers , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[46]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[47]  Mark J. Huiskes,et al.  The MIR flickr retrieval evaluation , 2008, MIR '08.

[48]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[49]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[50]  Jürgen Schmidhuber,et al.  Multimodal Similarity-Preserving Hashing , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[51]  Mingzhe Wang,et al.  LINE: Large-scale Information Network Embedding , 2015, WWW.

[52]  Vincent Lepetit,et al.  Learning descriptors for object recognition and 3D pose estimation , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[53]  Juhan Nam,et al.  Multimodal Deep Learning , 2011, ICML.