Imagination-Augmented Natural Language Understanding

Human brains integrate linguistic and perceptual information simultaneously to understand natural language, and hold the critical ability to render imaginations. Such abilities enable us to construct new abstract concepts or concrete objects, and are essential in involving practical knowledge to solve problems in low-resource scenarios. However, most existing methods for Natural Language Understanding (NLU) are mainly focused on textual signals. They do not simulate human visual imagination ability, which hinders models from inferring and learning efficiently from limited data samples. Therefore, we introduce an Imagination-Augmented Cross-modal Encoder (iACE) to solve natural language understanding tasks from a novel learning perspective—imagination-augmented cross-modal understanding. iACE enables visual imagination with external knowledge transferred from the powerful generative and pre-trained vision-and-language models. Extensive experiments on GLUE and SWAG show that iACE achieves consistent improvement over visually-supervised pre-trained models. More importantly, results in extreme and normal few-shot settings validate the effectiveness of iACE in low-resource natural language understanding circumstances.

[1]  Stella Rose Biderman,et al.  VQGAN-CLIP: Open Domain Image Generation and Editing with Natural Language Guidance , 2022, ECCV.

[2]  Dongyan Zhao,et al.  Things not Written in Text: Exploring Spatial Commonsense from Visual Signals , 2022, ACL.

[3]  Alexander G. Huth,et al.  Visual and linguistic semantic representations are aligned at the border of human visual cortex , 2021, Nature Neuroscience.

[4]  Aoying Zhou,et al.  Meta-Learning Adversarial Domain Adaptation Network for Few-Shot Text Classification , 2021, FINDINGS.

[5]  Oriol Vinyals,et al.  Multimodal Few-Shot Learning with Frozen Language Models , 2021, NeurIPS.

[6]  M. Eckstein,et al.  ImaginE: An Imagination-Based Automatic Evaluation Metric for Natural Language Generation , 2021, FINDINGS.

[7]  Christopher D. Manning,et al.  DReCa: A General Task Augmentation Strategy for Few-Shot Natural Language Inference , 2021, NAACL.

[8]  Soroush Vosoughi,et al.  Few-Shot Text Classification with Triplet Networks, Data Augmentation, and Curriculum Learning , 2021, NAACL.

[9]  Ilya Sutskever,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[10]  B. Ommer,et al.  Taming Transformers for High-Resolution Image Synthesis , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Mohit Bansal,et al.  Vokenization: Improving Language Understanding via Contextualized, Visually-Grounded Supervision , 2020, EMNLP.

[12]  Lei Li,et al.  Generative Imagination Elevates Machine Translation , 2020, NAACL.

[13]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[14]  Dian Yu,et al.  CLUE: A Chinese Language Understanding Evaluation Benchmark , 2020, COLING.

[15]  Jun Zhao,et al.  Knowledge Guided Metric Learning for Few-Shot Text Classification , 2020, NAACL.

[16]  Jianlong Fu,et al.  Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers , 2020, ArXiv.

[17]  Xilin Chen,et al.  UniViLM: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation , 2020, ArXiv.

[18]  知秀 柴田 5分で分かる!? 有名論文ナナメ読み:Jacob Devlin et al. : BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding , 2020 .

[19]  A. McCallum,et al.  Learning to Few-Shot Learn Across Diverse Natural Language Classification Tasks , 2019, COLING.

[20]  Laure Soulier,et al.  Incorporating Visual Semantics into Sentence Representations within a Grounded Space , 2019, EMNLP.

[21]  Kevin Gimpel,et al.  ALBERT: A Lite BERT for Self-supervised Learning of Language Representations , 2019, ICLR.

[22]  Yu Cheng,et al.  UNITER: UNiversal Image-TExt Representation Learning , 2019, ECCV.

[23]  Nan Duan,et al.  Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training , 2019, AAAI.

[24]  Cho-Jui Hsieh,et al.  VisualBERT: A Simple and Performant Baseline for Vision and Language , 2019, ArXiv.

[25]  Stefan Lee,et al.  ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks , 2019, NeurIPS.

[26]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[27]  Kevin Gimpel,et al.  Visually Grounded Neural Syntax Acquisition , 2019, ACL.

[28]  Cordelia Schmid,et al.  VideoBERT: A Joint Model for Video and Language Representation Learning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[29]  Jian Sun,et al.  Induction Networks for Few-Shot Text Classification , 2019, EMNLP.

[30]  Asim Kadav,et al.  Visual Entailment: A Novel Task for Fine-Grained Image Understanding , 2019, ArXiv.

[31]  Yejin Choi,et al.  SWAG: A Large-Scale Adversarial Dataset for Grounded Commonsense Inference , 2018, EMNLP.

[32]  Richard Socher,et al.  The Natural Language Decathlon: Multitask Learning as Question Answering , 2018, ArXiv.

[33]  Samuel R. Bowman,et al.  GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[34]  Gang Wang,et al.  Look, Imagine and Match: Improving Textual-Visual Cross-Modal Retrieval with Generative Models , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[35]  Allan Jabri,et al.  Learning Visually Grounded Sentence Representations , 2018, NAACL.

[36]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[37]  Samuel R. Bowman,et al.  A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference , 2017, NAACL.

[38]  Marie-Francine Moens,et al.  Imagined Visual Representations as Multimodal Embeddings , 2017, AAAI.

[39]  Marie-Francine Moens,et al.  Multi-Modal Representations for Improved Bilingual Lexicon Learning , 2016, ACL.

[40]  Jian Zhang,et al.  SQuAD: 100,000+ Questions for Machine Comprehension of Text , 2016, EMNLP.

[41]  Khalil Sima'an,et al.  Multi30K: Multilingual English-German Image Descriptions , 2016, VL@ACL.

[42]  Gordon Christie,et al.  Resolving Language and Vision Ambiguities Together: Joint Segmentation & Prepositional Attachment Resolution in Captioned Scenes , 2016, EMNLP.

[43]  Hideki Nakayama,et al.  Image-Mediated Learning for Zero-Shot Cross-Lingual Document Retrieval , 2015, EMNLP.

[44]  Stephen Clark,et al.  Visual Bilingual Lexicon Induction with Transferred ConvNet Features , 2015, EMNLP.

[45]  Angeliki Lazaridou,et al.  Combining Language and Vision with a Multimodal Skip-gram Model , 2015, NAACL.

[46]  Christopher Potts,et al.  Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank , 2013, EMNLP.

[47]  Pietro Perona,et al.  One-shot learning of object categories , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[48]  Michael Fink,et al.  Object Classification from a Single Example Utilizing Class Relevance Metrics , 2004, NIPS.

[49]  A. Linear-probe,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021 .

[50]  Zhuang Liu,et al.  A Robustly Optimized BERT Pre-training Approach with Post-training , 2021, CCL.

[51]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[52]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[53]  Sharlene D. Newman,et al.  From the SelectedWorks of Marcel Adam Just 2004 Imagery in sentence comprehension : An fMRI study , 2016 .

[54]  Paul Clough,et al.  The IAPR TC-12 Benchmark: A New Evaluation Resource for Visual Information Systems , 2006 .

[55]  Chris Brockett,et al.  Automatically Constructing a Corpus of Sentential Paraphrases , 2005, IJCNLP.

[56]  Mark Sadoski,et al.  A dual coding view of imagery and verbal processes in reading comprehension. , 1994 .