QIK: A System for Large-Scale Image Retrieval on Everyday Scenes With Common Objects

In this paper, we propose a system for large-scale image retrieval on everyday scenes with common objects by leveraging advances in deep learning and natural language processing (NLP). Unlike recent state-of-the-art approaches that extract image features from a convolutional neural network (CNN), our system exploits the predictions made by deep neural networks for image understanding tasks. Our system aims to capture the relationships between objects in an everyday scene rather than just the individual objects in the scene. It works as follows: For each image in the database, it generates most probable captions and detects objects in the image using state-of-the-art deep learning models. The captions are parsed and represented by tree structures using NLP techniques. These are stored and indexed in a database system. When a user poses a query image, its caption is generated using deep learning and parsed into its corresponding tree structures. Then an optimized tree-pattern query is constructed and executed on the database to retrieve a set of candidate images. Finally, these candidate images are ranked using the tree-edit distance metric computed on the tree structures. A query based on only objects detected in the query image can also be formulated and executed. In this case, the ranking scheme uses the probabilities of the detected objects. We evaluated the performance of our system on the Microsoft COCO dataset containing everyday scenes (with common objects) and observed that our system can outperform state-of-the-art techniques in terms of mean average precision for large-scale image retrieval.

[1]  Danqi Chen,et al.  A Fast and Accurate Dependency Parser using Neural Networks , 2014, EMNLP.

[2]  C. M. Sperberg-McQueen,et al.  eXtensible Markup Language (XML) 1.0 (Second Edition) , 2000 .

[3]  Victor S. Lempitsky,et al.  Aggregating Local Deep Features for Image Retrieval , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[4]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  James H. Martin,et al.  Speech and language processing: an introduction to natural language processing , 2000 .

[6]  Shin'ichi Satoh,et al.  Faster R-CNN Features for Instance Search , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[7]  Yannis Avrithis,et al.  Image Search with Selective Match Kernels: Aggregation Across Single and Multiple Images , 2016, International Journal of Computer Vision.

[8]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[9]  Kuang-Ching Wang,et al.  The Design and Operation of CloudLab , 2019, USENIX ATC.

[10]  Nan Hua,et al.  Universal Sentence Encoder , 2018, ArXiv.

[11]  Jiri Matas,et al.  Working hard to know your neighbor's margins: Local descriptor learning loss , 2017, NIPS.

[12]  Dumitru Erhan,et al.  Show and Tell: Lessons Learned from the 2015 MSCOCO Image Captioning Challenge , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[13]  Andrew Y. Ng,et al.  Parsing with Compositional Vector Grammars , 2013, ACL.

[14]  Christopher Hunt,et al.  Notes on the OpenSURF Library , 2009 .

[15]  W. Bruce Croft,et al.  Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , 2013 .

[16]  Mathias Lux,et al.  Lire: lucene image retrieval: an extensible java CBIR library , 2008, ACM Multimedia.

[17]  David Stutz,et al.  Neural Codes for Image Retrieval , 2015 .

[18]  Michael Isard,et al.  Object retrieval with large vocabularies and fast spatial matching , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[19]  Michael Isard,et al.  Lost in quantization: Improving particular object retrieval in large scale image databases , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[20]  Cordelia Schmid,et al.  Aggregating local descriptors into a compact image representation , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[21]  Andrew Zisserman,et al.  Video Google: a text retrieval approach to object matching in videos , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[22]  Bohyung Han,et al.  Large-Scale Image Retrieval with Attentive Deep Local Features , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[23]  Albert Gordo,et al.  Deep Image Retrieval: Learning Global Representations for Image Search , 2016, ECCV.

[24]  Vijay Vasudevan,et al.  Learning Transferable Architectures for Scalable Image Recognition , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[25]  Ronan Sicre,et al.  Particular object retrieval with integral max-pooling of CNN activations , 2015, ICLR.

[26]  Luc Van Gool,et al.  Speeded-Up Robust Features (SURF) , 2008, Comput. Vis. Image Underst..

[27]  Giorgos Tolias,et al.  Fine-Tuning CNN Image Retrieval with No Human Annotation , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[28]  Tomás Pajdla,et al.  NetVLAD: CNN Architecture for Weakly Supervised Place Recognition , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[29]  Peter Young,et al.  From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions , 2014, TACL.

[30]  Nikolaus Augsten,et al.  Efficient Computation of the Tree Edit Distance , 2015, TODS.

[31]  Cordelia Schmid,et al.  Aggregating Local Image Descriptors into Compact Codes , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[32]  Virginia Teller Review of Speech and language processing: an introduction to natural language processing, computational linguistics, and speech recognition by Daniel Jurafsky and James H. Martin. Prentice Hall 2000. , 2000 .

[33]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[34]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Albert Gordo,et al.  Beyond Instance-Level Image Retrieval: Leveraging Captions to Learn a Global Visual Representation for Semantic Retrieval , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Simon Osindero,et al.  Cross-Dimensional Weighting for Aggregated Deep Convolutional Features , 2015, ECCV Workshops.

[37]  Jiri Matas,et al.  Repeatability Is Not Enough: Learning Affine Regions via Discriminability , 2017, ECCV.

[38]  G LoweDavid,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004 .

[39]  Marc H. Scholl,et al.  XQuery Full Text Implementation in BaseX , 2009, XSym.

[40]  Quoc-Tuan Truong,et al.  VistaNet: Visual Aspect Attention Network for Multimodal Sentiment Analysis , 2019, AAAI.

[41]  Yannis Avrithis,et al.  Hough Pyramid Matching: Speeded-Up Geometry Re-ranking for Large Scale Image Retrieval , 2014, International Journal of Computer Vision.

[42]  Gabriela Csurka,et al.  Visual categorization with bags of keypoints , 2002, eccv 2004.

[43]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[45]  Cordelia Schmid,et al.  Hamming Embedding and Weak Geometric Consistency for Large Scale Image Search , 2008, ECCV.

[46]  Menglong Zhu,et al.  Detect-To-Retrieve: Efficient Regional Aggregation for Image Search , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[47]  Philip Bille,et al.  A survey on tree edit distance and related problems , 2005, Theor. Comput. Sci..