论文信息 - Interactive Image Search System Based on Multimodal Analogy

Interactive Image Search System Based on Multimodal Analogy

We propose an image search system based on multimodal analogy, which is enabled by using a visual-semantic embedding model. It allows us to perform analogical reasoning over images by specifying properties to be added to/subtracted with words such as [a image of a blue car] - ‘blue’ + ‘red’. The system mainly consists of the following two parts: (i) an encoder that learns image-text embeddings and (ii) a similarity measure between embeddings in a multimodal vector space. As for the encoder, we adopt a CNN-LSTM encoder proposed in [1], which was reported that it can learn multimodal linguistic regularities. We also introduce a new similarity measure based on the difference between additive and subtractive query. It gives us reasonably better results than the previous approach at qualitative analogical reasoning tasks.

Keiichiro Shirai | Minoru Maruyama | Kosuke Ota | Hidetoshi Miyao

[1] Geoffrey Zweig,et al. Linguistic Regularities in Continuous Space Word Representations , 2013, NAACL.

[2] Ruslan Salakhutdinov,et al. Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models , 2014, ArXiv.

[3] Andrew Zisserman,et al. Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[4] Marc'Aurelio Ranzato,et al. DeViSE: A Deep Visual-Semantic Embedding Model , 2013, NIPS.

[5] Jürgen Schmidhuber,et al. Long Short-Term Memory , 1997, Neural Computation.