Embedding Images and Sentences in a Common Space with a Recurrent Capsule Network

Associating texts and images is an easy and intuitive task for a human being, but it raises some issues if we want that task to be accomplished by a computer. Among these issues, there is the problem of finding a common representation for images and sentences. Based on recent research about capsule networks, we define a novel model to tackle that issue. This model is trained and compared to other recent models on the Flickr8k database on Image Retrieval and Image Annotation (or Sentence Retrieval) tasks. We propose a new recurrent architecture inspired from capsule networks to replace the traditional LSTM/GRU and show how it leads to improved performances. Moreover, we show that the interest of our model goes beyond its performances and includes its intrinsic characteristics, which can explain why it performs particularly well on the Image Annotation task. In addition, we propose a routing procedure between capsules which is fully learned during the training of our model.

[1]  Lior Wolf,et al.  RNN Fisher Vectors for Action Recognition and Image Annotation , 2015, ECCV.

[2]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[3]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[4]  Margaret Mitchell,et al.  VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[5]  Gang Wang,et al.  Look, Imagine and Match: Improving Textual-Visual Cross-Modal Retrieval with Generative Models , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[6]  Sanja Fidler,et al.  Order-Embeddings of Images and Language , 2015, ICLR.

[7]  David J. Fleet,et al.  VSE++: Improving Visual-Semantic Embeddings with Hard Negatives , 2017, BMVC.

[8]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[9]  Marc'Aurelio Ranzato,et al.  DeViSE: A Deep Visual-Semantic Embedding Model , 2013, NIPS.

[10]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[11]  Bernard Mérialdo,et al.  Natural Language Access to Video Databases , 2017, 2017 IEEE Third International Conference on Multimedia Big Data (BigMM).

[12]  Geoffrey E. Hinton,et al.  Dynamic Routing Between Capsules , 2017, NIPS.

[13]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[14]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[16]  Armand Joulin,et al.  Deep Fragment Embeddings for Bidirectional Image Sentence Mapping , 2014, NIPS.

[17]  Fei-Fei Li,et al.  Deep visual-semantic alignments for generating image descriptions , 2015, CVPR.

[18]  Jung-Woo Ha,et al.  Dual Attention Networks for Multimodal Reasoning and Matching , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[20]  Gang Hua,et al.  Hierarchical Multimodal LSTM for Dense Visual-Semantic Embedding , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[21]  Florent Perronnin,et al.  Fisher Kernels on Visual Vocabularies for Image Categorization , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[22]  Peter Young,et al.  Framing Image Description as a Ranking Task: Data, Models and Evaluation Metrics , 2013, J. Artif. Intell. Res..

[23]  H. Hotelling Relations Between Two Sets of Variates , 1936 .

[24]  Ruslan Salakhutdinov,et al.  Multimodal Neural Language Models , 2014, ICML.

[25]  Lior Wolf,et al.  Associating neural word embeddings with deep image representations using Fisher Vectors , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Geoffrey E. Hinton,et al.  Transforming Auto-Encoders , 2011, ICANN.

[27]  Aviv Eisenschtat,et al.  Linking Image and Text with 2-Way Nets , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).