A Caption Is Worth A Thousand Images: Investigating Image Captions for Multimodal Named Entity Recognition

Multimodal named entity recognition (MNER) requires to bridge the gap between language understanding and visual context. Due to advances in natural language processing (NLP) and computer vision (CV), many neural techniques have been proposed to incorporate images into the NER task. In this work, we conduct a detailed analysis of current state-of-the-art fusion techniques for MNER and describe scenarios where adding information from the image does not always result in boosts in performance. We also study the use of captions as a way to enrich the context for MNER. We provide extensive empirical analysis and an ablation study on three datasets from popular social platforms to expose the situations where the approach is beneficial.

[1]  Cina Motamed,et al.  CWI: A multimodal deep learning approach for named entity recognition from social media using character, word and image features , 2020, Neural Computing and Applications.

[2]  Thamar Solorio,et al.  A Multi-task Approach for Named Entity Recognition in Social Media Data , 2017, NUT@EMNLP.

[3]  Liyuan Liu,et al.  Raw-to-End Name Entity Recognition in Social Media , 2019, ArXiv.

[4]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[5]  Furu Wei,et al.  VL-BERT: Pre-training of Generic Visual-Linguistic Representations , 2019, ICLR.

[6]  Leonardo Neves,et al.  Multimodal Named Entity Recognition for Short Social Media Posts , 2018, NAACL.

[7]  Fabio A. González,et al.  Modeling Noisiness to Recognize Named Entities using Multitask Neural Networks on Social Media , 2018, NAACL.

[8]  Heng Ji,et al.  Visual Attention Model for Name Tagging in Multimodal Social Media , 2018, ACL.

[9]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Mohit Bansal,et al.  LXMERT: Learning Cross-Modality Encoder Representations from Transformers , 2019, EMNLP.

[11]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[12]  Ignazio Gallo,et al.  Aiding Intra-Text Representations with Visual Context for Multimodal Named Entity Recognition , 2019, 2019 International Conference on Document Analysis and Recognition (ICDAR).

[13]  Stefan Lee,et al.  ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks , 2019, NeurIPS.

[14]  Lei Zhang,et al.  Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[15]  Leon Derczynski,et al.  Results of the WNUT2017 Shared Task on Novel and Emerging Entity Recognition , 2017, NUT@EMNLP.

[16]  Xuanjing Huang,et al.  Adaptive Co-attention Network for Named Entity Recognition in Tweets , 2018, AAAI.

[17]  Alan Ritter,et al.  Results of the WNUT16 Named Entity Recognition Shared Task , 2016, NUT@COLING.

[18]  Cho-Jui Hsieh,et al.  VisualBERT: A Simple and Performant Baseline for Vision and Language , 2019, ArXiv.

[19]  Julia Hirschberg,et al.  Named Entity Recognition on Code-Switched Data: Overview of the CALCS 2018 Shared Task , 2018, CodeSwitch@ACL.

[20]  Roland Vollgraf,et al.  Pooled Contextualized Embeddings for Named Entity Recognition , 2019, NAACL.

[21]  Lin Su,et al.  ImageBERT: Cross-modal Pre-training with Large-scale Weak-supervised Image-Text Data , 2020, ArXiv.

[22]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[23]  Jianfei Yu,et al.  Improving Multimodal Named Entity Recognition via Entity Span Detection with Unified Multimodal Transformer , 2020, ACL.

[24]  Nan Duan,et al.  Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training , 2019, AAAI.