论文信息 - Imagination-Augmented Natural Language Understanding

Imagination-Augmented Natural Language Understanding

Human brains integrate linguistic and perceptual information simultaneously to understand natural language, and hold the critical ability to render imaginations. Such abilities enable us to construct new abstract concepts or concrete objects, and are essential in involving practical knowledge to solve problems in low-resource scenarios. However, most existing methods for Natural Language Understanding (NLU) are mainly focused on textual signals. They do not simulate human visual imagination ability, which hinders models from inferring and learning efficiently from limited data samples. Therefore, we introduce an Imagination-Augmented Cross-modal Encoder (iACE) to solve natural language understanding tasks from a novel learning perspective—imagination-augmented cross-modal understanding. iACE enables visual imagination with external knowledge transferred from the powerful generative and pre-trained vision-and-language models. Extensive experiments on GLUE and SWAG show that iACE achieves consistent improvement over visually-supervised pre-trained models. More importantly, results in extreme and normal few-shot settings validate the effectiveness of iACE in low-resource natural language understanding circumstances.

[1] Stella Rose Biderman,et al. VQGAN-CLIP: Open Domain Image Generation and Editing with Natural Language Guidance , 2022, ECCV.

[2] Dongyan Zhao,et al. Things not Written in Text: Exploring Spatial Commonsense from Visual Signals , 2022, ACL.

[3] Alexander G. Huth,et al. Visual and linguistic semantic representations are aligned at the border of human visual cortex , 2021, Nature Neuroscience.

[4] Aoying Zhou,et al. Meta-Learning Adversarial Domain Adaptation Network for Few-Shot Text Classification , 2021, FINDINGS.

[5] Oriol Vinyals,et al. Multimodal Few-Shot Learning with Frozen Language Models , 2021, NeurIPS.

[6] M. Eckstein,et al. ImaginE: An Imagination-Based Automatic Evaluation Metric for Natural Language Generation , 2021, FINDINGS.

[7] Christopher D. Manning,et al. DReCa: A General Task Augmentation Strategy for Few-Shot Natural Language Inference , 2021, NAACL.

[8] Soroush Vosoughi,et al. Few-Shot Text Classification with Triplet Networks, Data Augmentation, and Curriculum Learning , 2021, NAACL.

[9] Ilya Sutskever,et al. Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[10] B. Ommer,et al. Taming Transformers for High-Resolution Image Synthesis , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[11] Mohit Bansal,et al. Vokenization: Improving Language Understanding via Contextualized, Visually-Grounded Supervision , 2020, EMNLP.

[12] Lei Li,et al. Generative Imagination Elevates Machine Translation , 2020, NAACL.

[13] Mark Chen,et al. Language Models are Few-Shot Learners , 2020, NeurIPS.

[14] Dian Yu,et al. CLUE: A Chinese Language Understanding Evaluation Benchmark , 2020, COLING.

[15] Jun Zhao,et al. Knowledge Guided Metric Learning for Few-Shot Text Classification , 2020, NAACL.

[16] Jianlong Fu,et al. Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers , 2020, ArXiv.

[17] Xilin Chen,et al. UniViLM: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation , 2020, ArXiv.

[18] 知秀柴田. 5分で分かる!? 有名論文ナナメ読み：Jacob Devlin et al. : BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding , 2020 .

[19] A. McCallum,et al. Learning to Few-Shot Learn Across Diverse Natural Language Classification Tasks , 2019, COLING.

[20] Laure Soulier,et al. Incorporating Visual Semantics into Sentence Representations within a Grounded Space , 2019, EMNLP.

[21] Kevin Gimpel,et al. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations , 2019, ICLR.

[22] Yu Cheng,et al. UNITER: UNiversal Image-TExt Representation Learning , 2019, ECCV.