Learning to Learn Words from Visual Scenes

Language acquisition is the process of learning words from the surrounding scene. We introduce a meta-learning framework that learns how to learn word representations from unconstrained scenes. We leverage the natural compositional structure of language to create training episodes that cause a meta-learner to learn strong policies for language acquisition. Experiments on two datasets show that our approach is able to more rapidly acquire novel words as well as more robustly generalize to unseen compositions, significantly outperforming established baselines. A key advantage of our approach is that it is data efficient, allowing representations to be learned from scratch without language pre-training. Visualizations and analysis suggest visual information helps our approach learn a rich cross-modal representation from minimal examples. Project webpage is available at this https URL

[1]  Marcin Andrychowicz,et al.  Learning to learn by gradient descent by gradient descent , 2016, NIPS.

[2]  Stefan Lee,et al.  ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks , 2019, NeurIPS.

[3]  Tao Mei,et al.  Incorporating Copying Mechanism in Image Captioning for Learning Novel Objects , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Yizhou Sun,et al.  Few-Shot Representation Learning for Out-Of-Vocabulary Words , 2019, ACL.

[5]  Angeliki Lazaridou,et al.  Multimodal Word Meaning Induction From Minimal Exposure to Natural Text. , 2017, Cognitive science.

[6]  Hinrich Schütze,et al.  Attentive Mimicking: Better Word Embeddings by Attending to Informative Contexts , 2019, NAACL-HLT.

[7]  Wilson L. Taylor,et al.  “Cloze Procedure”: A New Tool for Measuring Readability , 1953 .

[8]  Martin Wattenberg,et al.  Google’s Multilingual Neural Machine Translation System: Enabling Zero-Shot Translation , 2016, TACL.

[9]  Mohit Bansal,et al.  LXMERT: Learning Cross-Modality Encoder Representations from Transformers , 2019, EMNLP.

[10]  Tao Xiang,et al.  Learning to Compare: Relation Network for Few-Shot Learning , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[11]  Yoshua Bengio,et al.  On the Optimization of a Synaptic Learning Rule , 2007 .

[12]  Noah D. Goodman,et al.  Evaluating Compositionality in Sentence Embeddings , 2018, CogSci.

[13]  Brenden M. Lake,et al.  Mutual exclusivity as a challenge for deep neural networks , 2019, NeurIPS.

[14]  Hang Li,et al.  Meta-SGD: Learning to Learn Quickly for Few Shot Learning , 2017, ArXiv.

[15]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[16]  Cordelia Schmid,et al.  VideoBERT: A Joint Model for Video and Language Representation Learning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[17]  David Reitter,et al.  Fusion of Detected Objects in Text for Visual Question Answering , 2019, EMNLP.

[18]  Yin Li,et al.  Compositional Learning for Human Object Interaction , 2018, ECCV.

[19]  Brenden M. Lake,et al.  Compositional generalization through meta sequence-to-sequence learning , 2019, NeurIPS.

[20]  Pieter Abbeel,et al.  Meta Learning Shared Hierarchies , 2017, ICLR.

[21]  Brenden M. Lake,et al.  Mutual exclusivity as a challenge for neural networks , 2019, ArXiv.

[22]  Chuang Gan,et al.  The Neuro-Symbolic Concept Learner: Interpreting Scenes Words and Sentences from Natural Supervision , 2019, ICLR.

[23]  Cyrus Rashtchian,et al.  Every Picture Tells a Story: Generating Sentences from Images , 2010, ECCV.

[24]  Nan Duan,et al.  Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training , 2019, AAAI.

[25]  Trevor Darrell,et al.  Deep Compositional Captioning: Describing Novel Object Categories without Paired Training Data , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Bernt Schiele,et al.  Zero-Shot Learning — The Good, the Bad and the Ugly , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Hugo Larochelle,et al.  Optimization as a Model for Few-Shot Learning , 2016, ICLR.

[28]  Cho-Jui Hsieh,et al.  VisualBERT: A Simple and Performant Baseline for Vision and Language , 2019, ArXiv.

[29]  Tao Mei,et al.  Pointing Novel Objects in Image Captioning , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Li Fei-Fei,et al.  CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Graham Neubig,et al.  Cross-Lingual Word Embeddings for Low-Resource Language Modeling , 2017, EACL.

[32]  Martial Hebert,et al.  From Red Wine to Red Tomato: Composition with Context , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Peter L. Bartlett,et al.  RL$^2$: Fast Reinforcement Learning via Slow Reinforcement Learning , 2016, ArXiv.

[34]  Jianwei Yang,et al.  Neural Baby Talk , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[35]  Louis-Philippe Morency,et al.  M-BERT: Injecting Multimodal Information in the BERT Structure , 2019, ArXiv.

[36]  Yash Goyal,et al.  Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Licheng Yu,et al.  UNITER: Learning UNiversal Image-TExt Representations , 2019, ArXiv.

[38]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[39]  Dima Damen,et al.  Fine-Grained Action Retrieval Through Multiple Parts-of-Speech Embeddings , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[40]  Jacob Eisenstein,et al.  Mimicking Word Embeddings using Subword RNNs , 2017, EMNLP.

[41]  Allyson Ettinger,et al.  Assessing Composition in Sentence Vector Representations , 2018, COLING.

[42]  Jianfeng Gao,et al.  Unified Vision-Language Pre-Training for Image Captioning and VQA , 2020, AAAI.

[43]  Navdeep Jaitly,et al.  Pointer Networks , 2015, NIPS.

[44]  P. Jusczyk,et al.  Some Beginnings of Word Comprehension in 6-Month-Olds , 1999 .

[45]  Yi Yang,et al.  Decoupled Novel Object Captioner , 2018, ACM Multimedia.

[46]  Hinrich Schütze,et al.  Rare Words: A Major Problem for Contextualized Embeddings And How to Fix it by Attentive Mimicking , 2019, AAAI.

[47]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[48]  Samuel R. Bowman,et al.  Human vs. Muppet: A Conservative Estimate of Human Performance on the GLUE Benchmark , 2019, ACL.

[49]  Peter Young,et al.  From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions , 2014, TACL.

[50]  Kilian Q. Weinberger,et al.  Distance Metric Learning for Large Margin Nearest Neighbor Classification , 2005, NIPS.

[51]  Mikhail Khodak,et al.  A La Carte Embedding: Cheap but Effective Induction of Semantic Feature Vectors , 2018, ACL.

[52]  Richard S. Zemel,et al.  Prototypical Networks for Few-shot Learning , 2017, NIPS.

[53]  Pieter Abbeel,et al.  A Simple Neural Attentive Meta-Learner , 2017, ICLR.

[54]  Cordelia Schmid,et al.  Contrastive Bidirectional Transformer for Temporal Representation Learning , 2019, ArXiv.

[55]  Furu Wei,et al.  VL-BERT: Pre-training of Generic Visual-Linguistic Representations , 2019, ICLR.

[56]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[57]  Dima Damen,et al.  Scaling Egocentric Vision: The EPIC-KITCHENS Dataset , 2018, ArXiv.

[58]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[59]  Holger Schwenk,et al.  Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond , 2018, Transactions of the Association for Computational Linguistics.

[60]  Pieter Abbeel,et al.  Meta-Learning with Temporal Convolutions , 2017, ArXiv.

[61]  Sergey Levine,et al.  Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks , 2017, ICML.

[62]  Desmond Elliott,et al.  Compositional Generalization in Image Captioning , 2019, CoNLL.

[63]  Kristen Grauman,et al.  Attributes as Operators , 2018, ECCV.

[64]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[65]  Marco Baroni,et al.  High-risk learning: acquiring new word vectors from tiny data , 2017, EMNLP.

[66]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).