Illustrative Language Understanding: Large-Scale Visual Grounding with Image Search

We introduce Picturebook, a large-scale lookup operation to ground language via ‘snapshots’ of our physical world accessed through image search. For each word in a vocabulary, we extract the top-k images from Google image search and feed the images through a convolutional network to extract a word embedding. We introduce a multimodal gating function to fuse our Picturebook embeddings with other word representations. We also introduce Inverse Picturebook, a mechanism to map a Picturebook embedding back into words. We experiment and report results across a wide range of tasks: word similarity, natural language inference, semantic relatedness, sentiment/topic classification, image-sentence ranking and machine translation. We also show that gate activations corresponding to Picturebook embeddings are highly correlated to human judgments of concreteness ratings.

[1]  Sabine Schulte im Walde,et al.  Exploring Multi-Modal Text+Image Models to Distinguish between Abstract and Concrete Nouns , 2017 .

[2]  Ido Dagan,et al.  context2vec: Learning Generic Context Embedding with Bidirectional LSTM , 2016, CoNLL.

[3]  Gang Wang,et al.  Look, Imagine and Match: Improving Textual-Visual Cross-Modal Retrieval with Generative Models , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[4]  Christopher Potts,et al.  A large annotated corpus for learning natural language inference , 2015, EMNLP.

[5]  Chong Wang,et al.  Towards Neural Phrase-based Machine Translation , 2017, ICLR.

[6]  Holger Schwenk,et al.  Supervised Learning of Universal Sentence Representations from Natural Language Inference Data , 2017, EMNLP.

[7]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[8]  Massimo Poesio,et al.  Visually Grounded and Textual Semantic Models Differentially Decode Brain Activity Associated with Concrete and Abstract Nouns , 2017, TACL.

[9]  Chengqi Zhang,et al.  Reinforced Self-Attention Network: a Hybrid of Hard and Soft Attention for Sequence Modeling , 2018, IJCAI.

[10]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[11]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[12]  Allan Jabri,et al.  Learning Visually Grounded Sentence Representations , 2018, NAACL.

[13]  Geoffrey E. Hinton,et al.  Regularizing Neural Networks by Penalizing Confident Output Distributions , 2017, ICLR.

[14]  Samuel R. Bowman,et al.  A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference , 2017, NAACL.

[15]  Felix Hill,et al.  Learning Abstract Concept Embeddings from Multi-Modal Data: Since You Probably Can’t See What I Mean , 2014, EMNLP.

[16]  Anders Søgaard,et al.  Limitations of Cross-Lingual Learning from Image Search , 2017, Rep4NLP@ACL.

[17]  Amy Beth Warriner,et al.  Concreteness ratings for 40 thousand generally known English word lemmas , 2014, Behavior research methods.

[18]  Marc'Aurelio Ranzato,et al.  Sequence Level Training with Recurrent Neural Networks , 2015, ICLR.

[19]  Yan Huang,et al.  Learning Semantic Concepts and Order for Image and Sentence Matching , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[20]  Marco Marelli,et al.  A SICK cure for the evaluation of compositional distributional semantic models , 2014, LREC.

[21]  Joost van de Weijer,et al.  LIUM-CVC Submissions for WMT18 Multimodal Translation Task , 2018, WMT.

[22]  Léon Bottou,et al.  Learning Image Embeddings using Convolutional Neural Networks for Improved Multi-Modal Semantics , 2014, EMNLP.

[23]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Angeliki Lazaridou,et al.  Combining Language and Vision with a Multimodal Skip-gram Model , 2015, NAACL.

[25]  Richard Socher,et al.  Learned in Translation: Contextualized Word Vectors , 2017, NIPS.

[26]  Xiang Zhang,et al.  Character-level Convolutional Networks for Text Classification , 2015, NIPS.

[27]  Marie-Francine Moens,et al.  Imagined Visual Representations as Multimodal Embeddings , 2017, AAAI.

[28]  Laure Soulier,et al.  Learning Multi-Modal Word Representation Grounded in Visual Context , 2017, AAAI.

[29]  Chandra Bhagavatula,et al.  Semi-supervised sequence tagging with bidirectional language models , 2017, ACL.

[30]  Randy Goebel,et al.  Using Visual Information to Predict Lexical Preference , 2011, RANLP.

[31]  Xi Chen,et al.  Stacked Cross Attention for Image-Text Matching , 2018, ECCV.

[32]  Goran Glavas,et al.  If Sentences Could See: Investigating Visual Information for Semantic Textual Similarity , 2017, IWCS.

[33]  Allan Jabri,et al.  Learning Visual Features from Large Weakly Supervised Data , 2015, ECCV.

[34]  Fabio A. González,et al.  Gated Multimodal Units for Information Fusion , 2017, ICLR.

[35]  Lior Wolf,et al.  Using the Output Embedding to Improve Language Models , 2016, EACL.

[36]  Benjamin Van Durme,et al.  Learning Bilingual Lexicons Using the Visual Similarity of Labeled Web Images , 2011, IJCAI.

[37]  Honglak Lee,et al.  An efficient framework for learning sentence representations , 2018, ICLR.

[38]  Yang Song,et al.  Learning Fine-Grained Image Similarity with Deep Ranking , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[39]  Stephen Clark,et al.  Improving Multi-Modal Representations Using Image Dispersion: Why Less is Sometimes More , 2014, ACL.

[40]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[41]  Alan L. Yuille,et al.  Training and Evaluating Multimodal Word Embeddings with Large-scale Web Annotated Images , 2016, NIPS.

[42]  Desmond Elliott,et al.  Findings of the Second Shared Task on Multimodal Machine Translation and Multilingual Image Description , 2017, WMT.

[43]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[44]  Tomas Mikolov,et al.  Efficient Large-Scale Multi-Modal Classification , 2018, AAAI.

[45]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[46]  Achim Rettinger,et al.  Towards Holistic Concept Representations: Embedding Relational Knowledge, Visual Attributes, and Distributional Word Semantics , 2017, International Semantic Web Conference.

[47]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[48]  Erhardt Barth,et al.  Recurrent Dropout without Memory Loss , 2016, COLING.

[49]  Geoffrey E. Hinton,et al.  Layer Normalization , 2016, ArXiv.

[50]  Stephen Clark,et al.  Comparing Data Sources and Architectures for Deep Visual Representation Learning in Semantics , 2016, EMNLP.

[51]  Alexander M. Rush,et al.  Sequence-to-Sequence Learning as Beam-Search Optimization , 2016, EMNLP.

[52]  Stephen Clark,et al.  Visual Bilingual Lexicon Induction with Transferred ConvNet Features , 2015, EMNLP.

[53]  Joelle Pineau,et al.  An Actor-Critic Algorithm for Sequence Prediction , 2016, ICLR.

[54]  Felix Hill,et al.  SimLex-999: Evaluating Semantic Models With (Genuine) Similarity Estimation , 2014, CL.

[55]  Rico Sennrich,et al.  Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[56]  Sungzoon Cho,et al.  Distance-based Self-Attention Network for Natural Language Inference , 2017, ArXiv.

[57]  Ruslan Salakhutdinov,et al.  Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models , 2014, ArXiv.

[58]  Marie-Francine Moens,et al.  Multi-Modal Representations for Improved Bilingual Lexicon Learning , 2016, ACL.

[59]  Carina Silberer,et al.  Visually Grounded Meaning Representations , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[60]  Stephen Clark,et al.  Vision and Feature Norms: Improving automatic feature norm learning through cross-modal maps , 2016, NAACL.

[61]  Christopher D. Manning,et al.  Improved Semantic Representations From Tree-Structured Long Short-Term Memory Networks , 2015, ACL.

[62]  Jean Maillard,et al.  Black Holes and White Rabbits: Metaphor Identification with Visual Features , 2016, NAACL.

[63]  Jiajun Zhang,et al.  Learning Multimodal Word Representation via Dynamic Fusion Methods , 2018, AAAI.

[64]  Sanja Fidler,et al.  Skip-Thought Vectors , 2015, NIPS.

[65]  Alon Lavie,et al.  Meteor Universal: Language Specific Translation Evaluation for Any Target Language , 2014, WMT@ACL.

[66]  Elia Bruni,et al.  Multimodal Distributional Semantics , 2014, J. Artif. Intell. Res..

[67]  Grzegorz Chrupala,et al.  Learning language through pictures , 2015, ACL.

[68]  Douwe Kiela MMFeat: A Toolkit for Extracting Multi-Modal Features , 2016, ACL.

[69]  Stephen Clark,et al.  Exploiting Image Generality for Lexical Entailment Detection , 2015, ACL.

[70]  David J. Fleet,et al.  VSE++: Improved Visual-Semantic Embeddings , 2017, ArXiv.

[71]  Khalil Sima'an,et al.  Multi30K: Multilingual English-German Image Descriptions , 2016, VL@ACL.

[72]  Yann LeCun,et al.  Very Deep Convolutional Networks for Text Classification , 2016, EACL.

[73]  Marcello Federico,et al.  Report on the 11th IWSLT evaluation campaign , 2014, IWSLT.

[74]  Stephen Clark,et al.  Speaking, Seeing, Understanding: Correlating semantic models with conceptual representation in the brain , 2017, EMNLP.

[75]  Felix Hill,et al.  Learning Distributed Representations of Sentences from Unlabelled Data , 2016, NAACL.

[76]  Hakan Inan,et al.  Tying Word Vectors and Word Classifiers: A Loss Framework for Language Modeling , 2016, ICLR.

[77]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[78]  Marie-Francine Moens,et al.  Is an Image Worth More than a Thousand Words? On the Fine-Grain Semantic Differences between Visual and Linguistic Representations , 2016, COLING.

[79]  Tomas Mikolov,et al.  Bag of Tricks for Efficient Text Classification , 2016, EACL.