With the prolification of multimodal interaction in various domains, recently there has been much interest in text based image retrieval in the computer vision community. However most of the state of the art techniques model this problem in a purely neural way, which makes it difficult to incorporate pragmatic strategies in searching a large scale catalog especially when the search requirements are insufficient and the model needs to resort to an interactive retrieval process through multiple iterations of question-answering. Motivated by this, we propose a neural-symbolic approach for a one-shot retrieval of images from a large scale catalog, given the caption description. To facilitate this, we represent the catalog and caption as scene-graphs and model the retrieval task as a learnable graph matching problem, trained end-to-end with a REINFORCE algorithm. Further, we briefly describe an extension of this pipeline to an iterative retrieval framework, based on interactive questioning and answering.
[1]
Yishay Mansour,et al.
Policy Gradient Methods for Reinforcement Learning with Function Approximation
,
1999,
NIPS.
[2]
Li Fei-Fei,et al.
CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning
,
2016,
2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[3]
Li Fei-Fei,et al.
Composing Text and Image for Image Retrieval - an Empirical Odyssey
,
2018,
2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[4]
José M. F. Moura,et al.
CLEVR-Dialog: A Diagnostic Dataset for Multi-Round Reasoning in Visual Dialog
,
2019,
NAACL.
[5]
Michael S. Bernstein,et al.
Image retrieval using scene graphs
,
2015,
2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[6]
Jeffrey Pennington,et al.
GloVe: Global Vectors for Word Representation
,
2014,
EMNLP.
[7]
Basura Fernando,et al.
SPICE: Semantic Propositional Image Caption Evaluation
,
2016,
ECCV.