论文信息 - Image Retrieval With Lingual And Visual Paraphrasing Via Generative Models

Image Retrieval With Lingual And Visual Paraphrasing Via Generative Models

A new approach that improves text-based image retrieval (hereinafter referred to as TBIR) performance is proposed in this paper. TBIR methods aim to retrieve a desired image related to a query text. Especially, recent TBIR methods allow us to retrieve images considering word relationships by using a sentence as a query. In these TBIR methods, it is necessary to uniquely identify a desired image from similar images using a single query sentence. However, the diverse expressive styles for a query sentence make it difficult to uniquely identify a desired image. In this paper, we propose a novel TBIR method with paraphrasing on multiple representation spaces. Specifically, by paraphrasing a query sentence on lingual and visual representation spaces, the proposed method can retrieve a desired image from various perspectives and then it can uniquely identify a desired image from similar images. Comprehensive experimental results show the effectiveness of the proposed method.

Miki Haseyama | Takahiro Ogawa | Ren Togo | Rintaro Yanagi

[1] Wei Wang,et al. A Comprehensive Survey on Cross-modal Retrieval , 2016, ArXiv.

[2] Swati Goel,et al. A Survey on Recent Image Indexing and Retrieval Techniques for Low-Level Feature Extraction in CBIR Systems , 2015, 2015 IEEE International Conference on Computational Intelligence & Communication Technology.

[3] Miki Haseyama,et al. Scene Retrieval Using Text-to-image GAN-based Visual Similarities and Image-to-text Model-based Textual Similarities , 2019, 2019 IEEE 8th Global Conference on Consumer Electronics (GCCE).

[4] Ruslan Salakhutdinov,et al. Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models , 2014, ArXiv.

[5] Yoshua Bengio,et al. Generative Adversarial Nets , 2014, NIPS.

[6] Samy Bengio,et al. Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7] Lin Ma,et al. Multimodal Convolutional Neural Networks for Matching Image and Sentence , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[8] M. Anusha,et al. Big Data-Survey , 2016 .

[9] Huchuan Lu,et al. Deep Cross-Modal Projection Learning for Image-Text Matching , 2018, ECCV.

[10] Kevin Gimpel,et al. Learning Paraphrastic Sentence Embeddings from Back-Translated Bitext , 2017, EMNLP.

[11] Yu Liu,et al. Learning a Recurrent Residual Fusion Network for Multimodal Matching , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[12] Wei Chen,et al. DM-GAN: Dynamic Memory Generative Adversarial Networks for Text-To-Image Synthesis , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[13] Miki Haseyama,et al. Scene Retrieval for Video Summarization Based on Text-to-Image gan , 2019, 2019 IEEE International Conference on Image Processing (ICIP).

[14] Miki Haseyama,et al. Enhancing Cross-Modal Retrieval Based on Modality-Specific and Embedding Spaces , 2020, IEEE Access.

[15] Ion Androutsopoulos,et al. A Survey of Paraphrasing and Textual Entailment Methods , 2009, J. Artif. Intell. Res..

[16] Pietro Perona,et al. Microsoft COCO: Common Objects in Context , 2014, ECCV.

[17] Zhe Gan,et al. AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[18] Jason Weston,et al. Engaging Image Captioning via Personality , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[19] Andrew Zisserman,et al. Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[20] Miki Haseyama,et al. Query is GAN: Scene Retrieval With Attentional Text-to-Image Generative Adversarial Network , 2019, IEEE Access.

[21] Lei Wu,et al. Tag Completion for Image Retrieval , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[22] Yoshua Bengio,et al. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[23] Jürgen Schmidhuber,et al. Long Short-Term Memory , 1997, Neural Computation.

[24] Iryna Gurevych,et al. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks , 2019, EMNLP.

[25] Sanja Fidler,et al. Order-Embeddings of Images and Language , 2015, ICLR.

[26] David J. Fleet,et al. VSE++: Improving Visual-Semantic Embeddings with Hard Negatives , 2017, BMVC.