论文信息 - A Robust Approach to Open Vocabulary Image Retrieval with Deep Convolutional Neural Networks and Transfer Learning

A Robust Approach to Open Vocabulary Image Retrieval with Deep Convolutional Neural Networks and Transfer Learning

Enabling computer systems to respond to conversational human language is a challenging problem with wideranging applications in the field of robotics and human computer interaction. Specifically, in image searches, humans tend to describe objects in fine-grained detail like color or company, for which conventional retrieval algorithms have shown poor performance. In this paper, a novel approach for open vocabulary image retrieval, capable of selecting the correct candidate image from among a set of distractions given a query in natural language form, is presented. Our methodology focuses on generating a robust set of image-text projections capable of accurately representing any image, with an objective of achieving high recall. To this end, an ensemble of classifiers is trained on ImageNet for representing high-resolution objects, Cifar 100 for smaller resolution images of objects and Caltech 256 for challenging views of everyday objects, for generating category-based projections. In addition to category based projections, we also make use of an image captioning model trained on MS COCO and Google Image Search (GISS) to capture additional semantic/latent information about the candidate images. To facilitate image retrieval, the natural language query and projection results are converted to a common vector representation using word embeddings, with which query-image similarity is computed. The proposed model when benchmarked on the RefCoco dataset, achieved an accuracy of 68.8%, while retrieving semantically meaningful candidate images.

[1] Eneko Agirre,et al. SemEval-2017 Task 1: Semantic Textual Similarity Multilingual and Crosslingual Focused Evaluation , 2017, *SEMEVAL.

[2] Christoph H. Lampert,et al. Learning to detect unseen object classes by between-class attribute transfer , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[3] Michael Isard,et al. Object retrieval with large vocabularies and fast spatial matching , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[4] Andrew Zisserman,et al. Multiple queries for large scale specific object retrieval , 2012, BMVC.

[5] Jeffrey Dean,et al. Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[6] Yi Yang,et al. An End-to-End Approach to Natural Language Object Retrieval via Context-Aware Deep Reinforcement Learning , 2017, ArXiv.

[7] Vicente Ordonez,et al. ReferItGame: Referring to Objects in Photographs of Natural Scenes , 2014, EMNLP.

[8] David Grangier,et al. A Discriminative Kernel-based Model to Rank Images from Text Queries , 2007 .

[9] Kevin Gimpel,et al. Towards Universal Paraphrastic Sentence Embeddings , 2015, ICLR.

[10] Licheng Yu,et al. Modeling Context in Referring Expressions , 2016, ECCV.

[11] Ronald M. Summers,et al. Deep Convolutional Neural Networks for Computer-Aided Detection: CNN Architectures, Dataset Characteristics and Transfer Learning , 2016, IEEE Transactions on Medical Imaging.

[12] Geoffrey E. Hinton,et al. ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[13] Ivor W. Tsang,et al. Using large-scale web data to facilitate textual query based retrieval of consumer photos , 2009, MM '09.

[14] Qiang Chen,et al. Network In Network , 2013, ICLR.

[15] Larry S. Davis,et al. Birdlets: Subordinate categorization using volumetric primitives and pose-normalized appearance , 2011, 2011 International Conference on Computer Vision.

[16] Cyrus Rashtchian,et al. Every Picture Tells a Story: Generating Sentences from Images , 2010, ECCV.

[17] Andrew Zisserman,et al. Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[18] Fang Zhao,et al. Deep Attribute-preserving Metric Learning for Natural Language Object Retrieval , 2017, ACM Multimedia.

[19] Jeffrey Pennington,et al. GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[20] G. Griffin,et al. Caltech-256 Object Category Dataset , 2007 .

[21] Trevor Darrell,et al. Open-vocabulary Object Retrieval , 2014, Robotics: Science and Systems.

[22] Fei-Fei Li,et al. What Does Classifying More Than 10, 000 Image Categories Tell Us? , 2010, ECCV.

[23] Frédéric Jurie,et al. Improving web image search results using query-relative classifiers , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[24] George A. Miller,et al. WordNet: A Lexical Database for English , 1995, HLT.

[25] Wen-Jyi Hwang,et al. Fast kNN classification algorithm based on partial distance search , 1998 .

[26] Jason Weston,et al. Joint Image and Word Sense Discrimination for Image Retrieval , 2012, ECCV.

[27] Mubarak Shah,et al. A 3-dimensional sift descriptor and its application to action recognition , 2007, ACM Multimedia.

[28] Yihong Gong,et al. Linear spatial pyramid matching using sparse coding for image classification , 2009, CVPR.

[29] Robinson Piramuthu,et al. HD-CNN: Hierarchical Deep Convolutional Neural Networks for Large Scale Visual Recognition , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[30] Andrew Zisserman,et al. VISOR: Towards On-the-Fly Large-Scale Object Category Retrieval , 2012, ACCV.

[31] Quoc V. Le,et al. Distributed Representations of Sentences and Documents , 2014, ICML.

[32] Samy Bengio,et al. A Discriminative Kernel-Based Approach to Rank Images from Text Queries , 2008, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[33] Rob Fergus,et al. Stochastic Pooling for Regularization of Deep Convolutional Neural Networks , 2013, ICLR.