Multimodal Attention with Image Text Spatial Relationship for OCR-Based Image Captioning

OCR-based image captioning is the task of automatically describing images based on reading and understanding written text contained in images. Compared to conventional image captioning, this task is more challenging, especially when the image contains multiple text tokens and visual objects. The difficulties originate from how to make full use of the knowledge contained in the textual entities to facilitate sentence generation and how to predict a text token based on the limited information provided by the image. Such problems are not yet fully investigated in existing research. In this paper, we present a novel design - Multimodal Attention Captioner with OCR Spatial Relationship (dubbed as MMA-SR) architecture, which manages information from different modalities with a multimodal attention network and explores spatial relationships between text tokens for OCR-based image captioning. Specifically, the representations of text tokens and objects are fed into a three-layer LSTM captioner. Different attention scores for text tokens and objects are exploited through the multimodal attention network. Based on the attended features and the LSTM states, words are selected from the common vocabulary or from the image text by incorporating the learned spatial relationships between text tokens. Extensive experiments conducted on the TextCaps dataset verify the effectiveness of the proposed MMA-SR method. More remarkably, our MMA-SR increases CIDEr-D score from 93.7% to 98.0%.

[1]  Yang Yang,et al.  Captioning Videos Using Large-Scale Image Corpus , 2017, Journal of Computer Science and Technology.

[2]  Andrew Zisserman,et al.  Reading Text in the Wild with Convolutional Neural Networks , 2014, International Journal of Computer Vision.

[3]  C. Lawrence Zitnick,et al.  CIDEr: Consensus-based image description evaluation , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  R. Smith,et al.  An Overview of the Tesseract OCR Engine , 2007, Ninth International Conference on Document Analysis and Recognition (ICDAR 2007).

[5]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[6]  Ye Yuan,et al.  Review Networks for Caption Generation , 2016, NIPS.

[7]  Jiebo Luo,et al.  Image Captioning with Semantic Attention , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[9]  Xinlei Chen,et al.  Towards VQA Models That Can Read , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Tao Mei,et al.  Exploring Visual Relationship for Image Captioning , 2018, ECCV.

[11]  Changming Sun,et al.  An End-to-End TextSpotter with Explicit Alignment and Attention , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[12]  Albert Gordo,et al.  Rosetta: Large Scale System for Text Detection and Recognition in Images , 2018, KDD.

[13]  Jing Wang,et al.  Show, Reward and Tell: Automatic Generation of Narrative Paragraph From Photo Stream by Adversarial Training , 2018, AAAI.

[14]  Ernest Valveny,et al.  Scene Text Visual Question Answering , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[15]  Ernest Valveny,et al.  ICDAR 2019 Competition on Scene Text Visual Question Answering , 2019, 2019 International Conference on Document Analysis and Recognition (ICDAR).

[16]  Yi Yang,et al.  Entangled Transformer for Image Captioning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[17]  Jie Chen,et al.  Attention on Attention for Image Captioning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[18]  Alon Lavie,et al.  METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.

[19]  Svetlana Lazebnik,et al.  Phrase Localization and Visual Relationship Detection with Comprehensive Image-Language Cues , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[20]  Yizhou Yu,et al.  Cross-Modal Relationship Inference for Grounding Referring Expressions , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Shashank Shekhar,et al.  OCR-VQA: Visual Question Answering by Reading Text in Images , 2019, 2019 International Conference on Document Analysis and Recognition (ICDAR).

[22]  Jianfei Cai,et al.  Auto-Encoding Scene Graphs for Image Captioning , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Ronghang Hu,et al.  TextCaps: a Dataset for Image Captioning with Reading Comprehension , 2020, ECCV.

[24]  Chunhua Shen,et al.  Towards End-to-End Text Spotting with Convolutional Recurrent Neural Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[25]  Wei Liu,et al.  Recurrent Fusion Network for Image Captioning , 2018, ECCV.

[26]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[27]  Tao Mei,et al.  Deep Collaborative Embedding for Social Image Understanding , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[28]  Trevor Darrell,et al.  Iterative Answer Prediction With Pointer-Augmented Multimodal Transformers for TextVQA , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Qi Tian,et al.  Social Anchor-Unit Graph Regularized Tensor Completion for Large-Scale Image Retagging , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[30]  Wenyu Liu,et al.  TextBoxes: A Fast Text Detector with a Single Deep Neural Network , 2016, AAAI.

[31]  Hongtao Lu,et al.  Look Back and Predict Forward in Image Captioning , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[33]  Tao Mei,et al.  Hierarchy Parsing for Image Captioning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[34]  Samy Bengio,et al.  Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Vaibhava Goel,et al.  Self-Critical Sequence Training for Image Captioning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[37]  Ankush Gupta,et al.  Synthetic Data for Text Localisation in Natural Images , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Junjie Yan,et al.  FOTS: Fast Oriented Text Spotting with a Unified Network , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[39]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[40]  Tao Mei,et al.  Convolutional Auto-encoding of Sentence Topics for Image Paragraph Generation , 2019, IJCAI.

[41]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[42]  Ernest Valveny,et al.  Word Spotting and Recognition with Embedded Attributes , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[43]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[44]  Meng Wang,et al.  Tri-Clustered Tensor Completion for Social-Aware Image Tag Refinement , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[45]  Lei Zhang,et al.  Bottom-Up and Top-Down Attention for Image Captioning and VQA , 2017, ArXiv.

[46]  Basura Fernando,et al.  SPICE: Semantic Propositional Image Caption Evaluation , 2016, ECCV.