From content to links: Social image embedding with deep multimodal model

Abstract With the popularity of social network, social media data embedding has attracted extensive research interest and boomed many applications, such as image classification and cross-modal retrieval. In this paper, we examine the scenario of social images containing multimodal content (e.g., visual content and textual tags) and connecting with each other (e.g., two images submitted to the same group). In such a case, both the multimodal content and link information provide useful clues for representation learning. Therefore, simply learning the embedding from network structure or data content results in sub-optimal social image representation. In this paper, we propose a Deep Multimodal Attention Networks (DMAN) to combine multimodal content and link information for social image embedding. Specifically, to effectively incorporate the multimodal content, a visual-textual attention model is proposed to encode the fine-granularity correlation between multimodal content, i.e., the alignment between image regions and textual words. To incorporate the network structure for embedding learning, a novel Siamese-Triplet neural network is proposed to model the first-order proximity and the second-order proximity among images. Then the two modules are integrated into a joint deep model for social image embedding. Once the representation has been learned, a wide variety of data mining problems can be solved by using the task-specific algorithms designed for handling vector representations. Extensive experiments are conducted to demonstrate the effectiveness of our approach on multi-label classification and cross-modal search.

[1]  Wenwu Zhu,et al.  Learning Socially Embedded Visual Representation from Scratch , 2015, ACM Multimedia.

[2]  Samy Bengio,et al.  Zero-Shot Learning by Convex Combination of Semantic Embeddings , 2013, ICLR.

[3]  Mario Fritz,et al.  Ask Your Neurons: A Neural-Based Approach to Answering Questions about Images , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[4]  Charu C. Aggarwal,et al.  Transfer Learning of Distance Metrics by Cross-Domain Metric Sampling across Heterogeneous Spaces , 2012, SDM.

[5]  Zhiyuan Liu,et al.  Representation Learning of Knowledge Graphs with Entity Descriptions , 2016, AAAI.

[6]  Changsheng Xu,et al.  Cross-Domain Feature Learning in Multimedia , 2015, IEEE Transactions on Multimedia.

[7]  Kate Saenko,et al.  Ask, Attend and Answer: Exploring Question-Guided Spatial Attention for Visual Question Answering , 2015, ECCV.

[8]  Nenghai Yu,et al.  Learning Spatial Regularization with Image-Level Supervisions for Multi-label Image Classification , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Peng Wang,et al.  Ask Me Anything: Free-Form Visual Question Answering Based on Knowledge from External Sources , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Charu C. Aggarwal,et al.  Heterogeneous Network Embedding via Deep Architectures , 2015, KDD.

[11]  John Shawe-Taylor,et al.  Canonical Correlation Analysis: An Overview with Application to Learning Methods , 2004, Neural Computation.

[12]  Zhi-Hua Zhou,et al.  A Unified View of Multi-Label Performance Measures , 2016, ICML.

[13]  Tim Weninger,et al.  Open-World Knowledge Graph Completion , 2017, AAAI.

[14]  Nanning Zheng,et al.  Large Margin Learning in Set-to-Set Similarity Comparison for Person Reidentification , 2017, IEEE Transactions on Multimedia.

[15]  Jian Zhang,et al.  Social Friend Recommendation Based on Multiple Network Correlation , 2016, IEEE Transactions on Multimedia.

[16]  Luc Van Gool,et al.  The Pascal Visual Object Classes (VOC) Challenge , 2010, International Journal of Computer Vision.

[17]  Tat-Seng Chua,et al.  NUS-WIDE: a real-world web image database from National University of Singapore , 2009, CIVR '09.

[18]  Steven Skiena,et al.  DeepWalk: online learning of social representations , 2014, KDD.

[19]  Mingzhe Wang,et al.  LINE: Large-scale Information Network Embedding , 2015, WWW.

[20]  James Philbin,et al.  FaceNet: A unified embedding for face recognition and clustering , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Michael Isard,et al.  A Multi-View Embedding Space for Modeling Internet Images, Tags, and Their Semantics , 2012, International Journal of Computer Vision.

[22]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[23]  Mark J. Huiskes,et al.  The MIR flickr retrieval evaluation , 2008, MIR '08.

[24]  Qiongkai Xu,et al.  GraRep: Learning Graph Representations with Global Structural Information , 2015, CIKM.

[25]  Yan Liu,et al.  A Unified Framework of Latent Feature Learning in Social Media , 2014, IEEE Transactions on Multimedia.

[26]  Dan Klein,et al.  Learning to Compose Neural Networks for Question Answering , 2016, NAACL.

[27]  Wenwu Zhu,et al.  Structural Deep Network Embedding , 2016, KDD.

[28]  Fei-Fei Li,et al.  Deep visual-semantic alignments for generating image descriptions , 2015, CVPR.

[29]  Jian Wang,et al.  Cross-Modal Retrieval via Deep and Bidirectional Representation Learning , 2016, IEEE Transactions on Multimedia.

[30]  Sheng Tang,et al.  Image Caption with Global-Local Attention , 2017, AAAI.

[31]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[32]  Heng Tao Shen,et al.  Video Captioning With Attention-Based LSTM and Semantic Consistency , 2017, IEEE Transactions on Multimedia.

[33]  Yang Song,et al.  Learning Fine-Grained Image Similarity with Deep Ranking , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[34]  Yongdong Zhang,et al.  GLA: Global–Local Attention for Image Description , 2018, IEEE Transactions on Multimedia.

[35]  Tao Mei,et al.  Concurrent Multiple Instance Learning for Image Categorization , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[36]  Yi Yang,et al.  Attention to Scale: Scale-Aware Semantic Image Segmentation , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Chong-Wah Ngo,et al.  Click-through-based cross-view learning for image search , 2014, SIGIR.

[38]  Marc'Aurelio Ranzato,et al.  DeViSE: A Deep Visual-Semantic Embedding Model , 2013, NIPS.

[39]  Mikhail Belkin,et al.  Laplacian Eigenmaps for Dimensionality Reduction and Data Representation , 2003, Neural Computation.

[40]  Gustavo Carneiro,et al.  Learning Local Image Descriptors with Deep Siamese and Triplet Convolutional Networks by Minimizing Global Loss Functions , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Pengpeng Zhao,et al.  Weak-Labeled Active Learning With Conditional Label Dependence for Multilabel Image Classification , 2017, IEEE Transactions on Multimedia.

[42]  Krystian Mikolajczyk,et al.  Deep correlation for matching images and text , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[43]  Alexander J. Smola,et al.  Stacked Attention Networks for Image Question Answering , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  David Liben-Nowell,et al.  The link-prediction problem for social networks , 2007 .

[45]  Jeff A. Bilmes,et al.  Deep Canonical Correlation Analysis , 2013, ICML.

[46]  Jiebo Luo,et al.  Image Captioning with Semantic Attention , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[47]  Changsheng Xu,et al.  Learning Consistent Feature Representation for Cross-Modal Multimedia Retrieval , 2015, IEEE Transactions on Multimedia.

[48]  Jiasen Lu,et al.  Hierarchical Question-Image Co-Attention for Visual Question Answering , 2016, NIPS.

[49]  Alan L. Yuille,et al.  Joint Image-Text Representation by Gaussian Visual-Semantic Embedding , 2016, ACM Multimedia.

[50]  Christopher Kanan,et al.  Answer-Type Prediction for Visual Question Answering , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[51]  Seunghoon Hong,et al.  Learning Transferrable Knowledge for Semantic Segmentation with Deep Convolutional Neural Network , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[52]  Feiran Huang,et al.  Learning Social Image Embedding with Deep Multimodal Attention Networks , 2017, ACM Multimedia.

[53]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[54]  Jure Leskovec,et al.  Image Labeling on a Network: Using Social-Network Metadata for Image Classification , 2012, ECCV.