论文信息 - Bidirectional Retrieval Made Simple

Bidirectional Retrieval Made Simple

This paper provides a very simple yet effective character-level architecture for learning bidirectional retrieval models. Aligning multimodal content is particularly challenging considering the difficulty in finding semantic correspondence between images and descriptions. We introduce an efficient character-level inception module, designed to learn textual semantic embeddings by convolving raw characters in distinct granularity levels. Our approach is capable of explicitly encoding hierarchical information from distinct base-level representations (e.g., characters, words, and sentences) into a shared multimodal space, where it maps the semantic correspondence between images and descriptions via a contrastive pairwise loss function that minimizes order-violations. Models generated by our approach are far more robust to input noise than state-of-the-art strategies based on word-embeddings. Despite being conceptually much simpler and requiring fewer parameters, our models outperform the state-of-the-art approaches by 4.8% in the task of description retrieval and 2.7% (absolute R@1 values) in the task of image retrieval in the popular MS COCO retrieval dataset. We also show that our models present solid performance for text classification, specially in multilingual and noisy domains.

Rodrigo C. Barros | Jonatas Wehrmann | Jonatas Wehrmann

[1] Dumitru Erhan,et al. Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2] Christopher Joseph Pal,et al. Movie Description , 2016, International Journal of Computer Vision.

[3] Matthieu Cord,et al. MUTAN: Multimodal Tucker Fusion for Visual Question Answering , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[4] Pietro Perona,et al. Microsoft COCO: Common Objects in Context , 2014, ECCV.

[5] Trevor Darrell,et al. Captioning Images with Diverse Objects , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6] Michael S. Bernstein,et al. ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[7] Igor Mozetic,et al. Multilingual Twitter Sentiment Classification: The Role of Human Annotators , 2016, PloS one.

[8] Sergey Ioffe,et al. Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning , 2016, AAAI.

[9] Lin Ma,et al. Multimodal Convolutional Neural Networks for Matching Image and Sentence , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[10] Yann LeCun,et al. Very Deep Convolutional Networks for Text Classification , 2016, EACL.

[11] Yoon Kim,et al. Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[12] Jürgen Schmidhuber,et al. Long Short-Term Memory , 1997, Neural Computation.

[13] Svetlana Lazebnik,et al. Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models , 2015, International Journal of Computer Vision.

[14] Rodrigo C. Barros,et al. Order embeddings and character-level convolutions for multimodal alignment , 2017, Pattern Recognit. Lett..

[15] Sanja Fidler,et al. Order-Embeddings of Images and Language , 2015, ICLR.

[16] Rodrigo C. Barros,et al. A character-based convolutional neural network for language-agnostic Twitter sentiment analysis , 2017, 2017 International Joint Conference on Neural Networks (IJCNN).

[17] Yoshua Bengio,et al. Gated Feedback Recurrent Neural Networks , 2015, ICML.

[18] David J. Fleet,et al. VSE++: Improved Visual-Semantic Embeddings , 2017, ArXiv.

[19] Lior Wolf,et al. Associating neural word embeddings with deep image representations using Fisher Vectors , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20] Andrew Zisserman,et al. Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[21] Ruslan Salakhutdinov,et al. Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models , 2014, ArXiv.

[22] Ruslan Salakhutdinov,et al. Multimodal Neural Language Models , 2014, ICML.

[23] Fei-Fei Li,et al. Deep visual-semantic alignments for generating image descriptions , 2015, CVPR.

[24] Rodrigo C. Barros,et al. Fast Self-Attentive Multimodal Retrieval , 2018, 2018 IEEE Winter Conference on Applications of Computer Vision (WACV).

[25] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26] Tomas Mikolov,et al. Bag of Tricks for Efficient Text Classification , 2016, EACL.

[27] Jeffrey Dean,et al. Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[28] Xiang Zhang,et al. Character-level Convolutional Networks for Text Classification , 2015, NIPS.

[29] Aviv Eisenschtat,et al. Linking Image and Text with 2-Way Nets , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30] Margaret Mitchell,et al. VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[31] Wei Wang,et al. Instance-Aware Image and Sentence Matching with Selective Multimodal LSTM , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[32] Yin Li,et al. Learning Deep Structure-Preserving Image-Text Embeddings , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).