Target-Oriented Deformation of Visual-Semantic Embedding Space

Multimodal embedding is a crucial research topic for cross-modal understanding, data mining, and translation. Many studies have attempted to extract representations from given entities and align them in a shared embedding space. However, because entities in different modalities exhibit different abstraction levels and modality-specific information, it is insufficient to embed related entities close to each other. In this study, we propose the Target-Oriented Deformation Network (TOD-Net), a novel module that continuously deforms the embedding space into a new space under a given condition, thereby adjusting similarities between entities. Unlike methods based on cross-modal attention, TOD-Net is a post-process applied to the embedding space learned by existing embedding systems and improves their performances of retrieval. In particular, when combined with cutting-edge models, TOD-Net gains the state-of-the-art cross-modal retrieval model associated with the MSCOCO dataset. Qualitative analysis reveals that TOD-Net successfully emphasizes entity-specific concepts and retrieves diverse targets via handling higher levels of diversity than existing models.

[1]  Zhedong Zheng,et al.  Dual-path Convolutional Image-Text Embeddings with Instance Loss , 2017, ACM Trans. Multim. Comput. Commun. Appl..

[2]  Douwe Kiela,et al.  Poincaré Embeddings for Learning Hierarchical Representations , 2017, NIPS.

[3]  Xilin Chen,et al.  Cross-modal Scene Graph Matching for Relationship-aware Image-Text Retrieval , 2019, 2020 IEEE Winter Conference on Applications of Computer Vision (WACV).

[4]  Thomas Hofmann,et al.  Hyperbolic Entailment Cones for Learning Hierarchical Embeddings , 2018, ICML.

[5]  Yongdong Zhang,et al.  Focus Your Attention: A Bidirectional Focal Attention Network for Image-Text Matching , 2019, ACM Multimedia.

[6]  Matthieu Cord,et al.  WELDON: Weakly Supervised Learning of Deep Convolutional Neural Networks , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Wei Wang,et al.  Instance-Aware Image and Sentence Matching with Selective Multimodal LSTM , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[9]  P. Pérez,et al.  SoDeep: A Sorting Deep Net to Learn Ranking Loss Surrogates , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Yang Yang,et al.  Matching Images and Text with Multi-modal Tensor Fusion and Re-ranking , 2019, ACM Multimedia.

[11]  Xuanjing Huang,et al.  Reinforced Evolutionary Neural Architecture Search , 2018, ArXiv.

[12]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[13]  Marc'Aurelio Ranzato,et al.  DeViSE: A Deep Visual-Semantic Embedding Model , 2013, NIPS.

[14]  Yale Song,et al.  Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Andrew Gordon Wilson,et al.  Hierarchical Density Order Embeddings , 2018, ICLR.

[16]  Lin Ma,et al.  Multimodal Convolutional Neural Networks for Matching Image and Sentence , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[17]  G. Lewicki,et al.  Approximation by Superpositions of a Sigmoidal Function , 2003 .

[18]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Xiaogang Wang,et al.  CAMP: Cross-Modal Adaptive Message Passing for Text-Image Retrieval , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[20]  Jianfeng Gao,et al.  Learning Visual Relation Priors for Image-Text Matching and Image Captioning with Neural Scene Graph Generators , 2019, ArXiv.

[21]  Liang Wang,et al.  Few-Shot Image and Sentence Matching via Gated Visual-Semantic Embedding , 2019, AAAI.

[22]  Yan Huang,et al.  Learning Semantic Concepts and Order for Image and Sentence Matching , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[23]  Yun Fu,et al.  Visual Semantic Reasoning for Image-Text Matching , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[24]  Yu Zhang,et al.  Simple Recurrent Units for Highly Parallelizable Recurrence , 2017, EMNLP.

[25]  Gang Wang,et al.  Look, Imagine and Match: Improving Textual-Visual Cross-Modal Retrieval with Generative Models , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[26]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[27]  Kate Saenko,et al.  Learning Similarity Conditions Without Explicit Supervision , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[28]  Ruslan Salakhutdinov,et al.  Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models , 2014, ArXiv.

[29]  David J. Fleet,et al.  VSE++: Improving Visual-Semantic Embeddings with Hard Negatives , 2017, BMVC.

[30]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[31]  Sanja Fidler,et al.  Skip-Thought Vectors , 2015, NIPS.

[32]  Samy Bengio,et al.  Density estimation using Real NVP , 2016, ICLR.

[33]  Geoffrey E. Hinton,et al.  Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.

[34]  Xi Chen,et al.  Stacked Cross Attention for Image-Text Matching , 2018, ECCV.

[35]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[36]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[37]  Martin Engilberge,et al.  Finding Beans in Burgers: Deep Semantic-Visual Embedding with Localization , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[38]  Aviv Eisenschtat,et al.  Linking Image and Text with 2-Way Nets , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  Jian Sun,et al.  Identity Mappings in Deep Residual Networks , 2016, ECCV.

[40]  Serge J. Belongie,et al.  Conditional Similarity Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Yuxin Peng,et al.  CM-GANs , 2019, ACM Trans. Multim. Comput. Commun. Appl..

[42]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[43]  Sanja Fidler,et al.  Order-Embeddings of Images and Language , 2015, ICLR.

[44]  Yin Li,et al.  Learning Deep Structure-Preserving Image-Text Embeddings , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).