论文信息 - Scene Retrieval Using Text-to-image GAN-based Visual Similarities and Image-to-text Model-based Textual Similarities

Scene Retrieval Using Text-to-image GAN-based Visual Similarities and Image-to-text Model-based Textual Similarities

Scene retrieval from a video database is a fundamental study in computer vision. Traditionally, content based retrieval methods can retrieve objective scenes with high accuracy by utilizing visual features. However, users cannot utilize content based retrieval methods when they cannot prepare query contents. To solve this problem, in this paper, we propose a novel content based scene retrieval method focusing on text-to-image Generative Adversarial Network and image-to-text model. By utilizing the proposed method, we can retrieve objective scenes in visual feature space with high accuracy even though it only utilizes a sentence as an input. Experimental results show the effectiveness of the proposed method.

Miki Haseyama | Takahiro Ogawa | Ren Togo | Rintaro Yanagi

[1] Sergey Ioffe,et al. Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2] Pietro Perona,et al. Microsoft COCO: Common Objects in Context , 2014, ECCV.

[3] Zhe Gan,et al. AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[4] Xirong Li,et al. Word2VisualVec: Image and Video to Sentence Matching by Visual Feature Prediction , 2016 .

[5] Matteo Pagliardini,et al. Unsupervised Learning of Sentence Embeddings Using Compositional n-Gram Features , 2017, NAACL.

[6] Cedric Nishan Canagarajah,et al. Video Scene Retrieval Based on Local Region Features , 2006, 2006 International Conference on Image Processing.

[7] Miki Haseyama,et al. Image Retrieval from Vague Description Based on AttnGAN , 2018, 2018 IEEE 7th Global Conference on Consumer Electronics (GCCE).

[8] Ruslan Salakhutdinov,et al. Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models , 2014, ArXiv.

[9] Samy Bengio,et al. Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10] Katashi Nagao,et al. Video Scene Retrieval Using Online Video Annotation , 2007, JSAI.

[11] Bernt Schiele,et al. A dataset for Movie Description , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12] Sanja Fidler,et al. Order-Embeddings of Images and Language , 2015, ICLR.

[13] David J. Fleet,et al. VSE++: Improving Visual-Semantic Embeddings with Hard Negatives , 2017, BMVC.