论文信息 - Target-Oriented Deformation of Visual-Semantic Embedding Space

Target-Oriented Deformation of Visual-Semantic Embedding Space

Multimodal embedding is a crucial research topic for cross-modal understanding, data mining, and translation. Many studies have attempted to extract representations from given entities and align them in a shared embedding space. However, because entities in different modalities exhibit different abstraction levels and modality-specific information, it is insufficient to embed related entities close to each other. In this study, we propose the Target-Oriented Deformation Network (TOD-Net), a novel module that continuously deforms the embedding space into a new space under a given condition, thereby adjusting similarities between entities. Unlike methods based on cross-modal attention, TOD-Net is a post-process applied to the embedding space learned by existing embedding systems and improves their performances of retrieval. In particular, when combined with cutting-edge models, TOD-Net gains the state-of-the-art cross-modal retrieval model associated with the MSCOCO dataset. Qualitative analysis reveals that TOD-Net successfully emphasizes entity-specific concepts and retrieves diverse targets via handling higher levels of diversity than existing models.

Takashi Matsubara | Takashi Matsubara

[1] Zhedong Zheng,et al. Dual-path Convolutional Image-Text Embeddings with Instance Loss , 2017, ACM Trans. Multim. Comput. Commun. Appl..

[2] Douwe Kiela,et al. Poincaré Embeddings for Learning Hierarchical Representations , 2017, NIPS.

[3] Xilin Chen,et al. Cross-modal Scene Graph Matching for Relationship-aware Image-Text Retrieval , 2019, 2020 IEEE Winter Conference on Applications of Computer Vision (WACV).

[4] Thomas Hofmann,et al. Hyperbolic Entailment Cones for Learning Hierarchical Embeddings , 2018, ICML.

[5] Yongdong Zhang,et al. Focus Your Attention: A Bidirectional Focal Attention Network for Image-Text Matching , 2019, ACM Multimedia.

[6] Matthieu Cord,et al. WELDON: Weakly Supervised Learning of Deep Convolutional Neural Networks , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7] Wei Wang,et al. Instance-Aware Image and Sentence Matching with Selective Multimodal LSTM , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8] Fei-Fei Li,et al. ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[9] P. Pérez,et al. SoDeep: A Sorting Deep Net to Learn Ranking Loss Surrogates , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[10] Yang Yang,et al. Matching Images and Text with Multi-modal Tensor Fusion and Re-ranking , 2019, ACM Multimedia.

[11] Xuanjing Huang,et al. Reinforced Evolutionary Neural Architecture Search , 2018, ArXiv.

[12] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[13] Marc'Aurelio Ranzato,et al. DeViSE: A Deep Visual-Semantic Embedding Model , 2013, NIPS.

[14] Yale Song,et al. Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15] Andrew Gordon Wilson,et al. Hierarchical Density Order Embeddings , 2018, ICLR.

[16] Lin Ma,et al. Multimodal Convolutional Neural Networks for Matching Image and Sentence , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[17] G. Lewicki,et al. Approximation by Superpositions of a Sigmoidal Function , 2003 .

[18] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19] Xiaogang Wang,et al. CAMP: Cross-Modal Adaptive Message Passing for Text-Image Retrieval , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[20] Jianfeng Gao,et al. Learning Visual Relation Priors for Image-Text Matching and Image Captioning with Neural Scene Graph Generators , 2019, ArXiv.

[21] Liang Wang,et al. Few-Shot Image and Sentence Matching via Gated Visual-Semantic Embedding , 2019, AAAI.

[22] Yan Huang,et al. Learning Semantic Concepts and Order for Image and Sentence Matching , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[23] Yun Fu,et al. Visual Semantic Reasoning for Image-Text Matching , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[24] Yu Zhang,et al. Simple Recurrent Units for Highly Parallelizable Recurrence , 2017, EMNLP.

[25] Gang Wang,et al. Look, Imagine and Match: Improving Textual-Visual Cross-Modal Retrieval with Generative Models , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[26] Pietro Perona,et al. Microsoft COCO: Common Objects in Context , 2014, ECCV.

[27] Kate Saenko,et al. Learning Similarity Conditions Without Explicit Supervision , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[28] Ruslan Salakhutdinov,et al. Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models , 2014, ArXiv.

[29] David J. Fleet,et al. VSE++: Improving Visual-Semantic Embeddings with Hard Negatives , 2017, BMVC.

[30] Sergey Ioffe,et al. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[31] Sanja Fidler,et al. Skip-Thought Vectors , 2015, NIPS.

[32] Samy Bengio,et al. Density estimation using Real NVP , 2016, ICLR.

[33] Geoffrey E. Hinton,et al. Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.

[34] Xi Chen,et al. Stacked Cross Attention for Image-Text Matching , 2018, ECCV.

[35] Jeffrey Dean,et al. Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[36] Yoshua Bengio,et al. Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[37] Martin Engilberge,et al. Finding Beans in Burgers: Deep Semantic-Visual Embedding with Localization , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[38] Aviv Eisenschtat,et al. Linking Image and Text with 2-Way Nets , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[39] Jian Sun,et al. Identity Mappings in Deep Residual Networks , 2016, ECCV.

[40] Serge J. Belongie,et al. Conditional Similarity Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[41] Yuxin Peng,et al. CM-GANs , 2019, ACM Trans. Multim. Comput. Commun. Appl..

[42] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[43] Sanja Fidler,et al. Order-Embeddings of Images and Language , 2015, ICLR.

[44] Yin Li,et al. Learning Deep Structure-Preserving Image-Text Embeddings , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).