Compact Bidirectional Transformer for Image Captioning

Most current image captioning models typically generate captions from left to right. This unidirectional property makes them can only leverage past context but not future context. Though recent refinement-based models can exploit both past and future context by generating a new caption in the second stage based on pre-retrieved or pre-generated captions in the first stage, the decoder of these models generally consists of two networks (i.e. a retriever or captioner in the first stage and a refiner in the second stage), which can only be executed sequentially. In this paper, we introduce a Compact Bidirectional Transformer model for image captioning that can leverage bidirectional context implicitly and explicitly while the decoder can be executed parallelly. Specifically, it is implemented by tightly coupling left-toright(L2R) and right-to-left(R2L) flows into a single compact model (i.e. implicitly) and optionally allowing interaction of the two flows(i.e. explicitly), while the final caption is chosen from either L2R or R2L flow in a sentencelevel ensemble manner. We conduct extensive ablation studies on MSCOCO benchmark and find that the compact architecture, which serves as a regularization for implicitly exploiting bidirectional context, and the sentence-level ensemble play more important roles than the explicit interaction mechanism. By combining with word-level ensemble seamlessly, the effect of sentence-level ensemble is further enlarged. We further extend the conventional one-flow selfcritical training to the two-flows version under this architecture and achieve new state-of-the-art results in comparison with non-vision-language-pretraining models. Source code is available at https://github.com/YuanEZhou/CBTrans. INTRODUCTION Image captioning (Vinyals et al. 2015; Yang et al. 2019; Pan et al. 2020; Zhou et al. 2021), which aims at describing the visual content of an image with natural language sentences, is one of the important tasks to connect vision and language. Inspired by the sequence-to-sequence model (Sutskever, Vinyals, and Le 2014) for neural machine translation, most proposed models typically follow the encoder/decoder paradigm. In between, a convolutional neural network (CNN) or Transformer is utilized to encode an input image and recurrent neural networks (RNN) or Transformer (Vaswani et al. 2017) is adopted as sentence decoder to generate a caption. <l2r> a man in <r2l> beach the on a man in ...

[1]  Hongtao Lu,et al.  Show, Recall, and Tell: Image Captioning with Recall Mechanism , 2020, AAAI.

[2]  Gang Wang,et al.  An Empirical Study of Language CNN for Image Captioning , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[3]  Jianfeng Gao,et al.  Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks , 2020, ECCV.

[4]  Hanwang Zhang,et al.  More Grounded Image Captioning by Distilling Image-Text Matching Model , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Desmond Elliott,et al.  Multilingual Image Description with Neural Sequence Models , 2015, 1510.04709.

[6]  Rich Caruana,et al.  Multitask Learning , 1998, Encyclopedia of Machine Learning and Data Mining.

[7]  Xinlei Chen,et al.  In Defense of Grid Features for Visual Question Answering , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Fawaz Sammani,et al.  Look and Modify: Modification Networks for Image Captioning , 2019, BMVC.

[9]  Fei-Fei Li,et al.  Deep visual-semantic alignments for generating image descriptions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Jianfei Cai,et al.  Auto-Encoding Scene Graphs for Image Captioning , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[12]  Wei Zhao,et al.  A Multi-task Learning Approach for Image Captioning , 2018, IJCAI.

[13]  Jie Chen,et al.  Attention on Attention for Image Captioning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[14]  Rita Cucchiara,et al.  Meshed-Memory Transformer for Image Captioning , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[16]  Tao Mei,et al.  X-Linear Attention Networks for Image Captioning , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Jiajun Zhang,et al.  Synchronous Bidirectional Neural Machine Translation , 2019, TACL.

[18]  Tao Mei,et al.  Boosting Image Captioning with Attributes , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[19]  Yejin Choi,et al.  VinVL: Revisiting Visual Representations in Vision-Language Models , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Xinlei Chen,et al.  Microsoft COCO Captions: Data Collection and Evaluation Server , 2015, ArXiv.

[21]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Lei Zhang,et al.  Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[23]  Alexander Schwing,et al.  Fast, Diverse and Accurate Image Captioning Guided by Part-Of-Speech , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Geoffrey E. Hinton,et al.  Layer Normalization , 2016, ArXiv.

[25]  Yongjian Wu,et al.  RSTNet: Captioning with Adaptive Attention on Visual and Non-Visual Words , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Tao Mei,et al.  Exploring Visual Relationship for Image Captioning , 2018, ECCV.

[27]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[28]  Chunfeng Yuan,et al.  Open-book Video Captioning with Retrieve-Copy-Generate Network , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[30]  Enhong Chen,et al.  Regularizing Neural Machine Translation by Target-bidirectional Agreement , 2018, AAAI.

[31]  Hongtao Lu,et al.  Look Back and Predict Forward in Image Captioning , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Samy Bengio,et al.  Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[34]  Fawaz Sammani,et al.  Show, Edit and Tell: A Framework for Editing Image Captions , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Wei Liu,et al.  Recurrent Fusion Network for Image Captioning , 2018, ECCV.

[36]  Zhenzhen Hu,et al.  Semi-Autoregressive Transformer for Image Captioning , 2021, 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW).

[37]  Alon Lavie,et al.  METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.

[38]  Jiebo Luo,et al.  Joint Syntax Representation Learning and Visual Cue Translation for Video Captioning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[39]  Xin Wang,et al.  VaTeX: A Large-Scale, High-Quality Multilingual Dataset for Video-and-Language Research , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[40]  Zhe Gan,et al.  Distilling Knowledge Learned in BERT for Text Generation , 2019, ACL.

[41]  Yongjian Wu,et al.  Improving Image Captioning by Leveraging Intra- and Inter-layer Global Representation in Transformer Network , 2020, AAAI.

[42]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[43]  Yi Yang,et al.  Entangled Transformer for Image Captioning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[44]  Basura Fernando,et al.  SPICE: Semantic Propositional Image Caption Evaluation , 2016, ECCV.

[45]  Ruotian Luo A Better Variant of Self-Critical Sequence Training , 2020, ArXiv.

[46]  Jianfeng Gao,et al.  Unified Vision-Language Pre-Training for Image Captioning and VQA , 2020, AAAI.

[47]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[48]  Rongrong Ji,et al.  Asynchronous Bidirectional Decoding for Neural Machine Translation , 2018, AAAI.

[49]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[50]  Wei Liu,et al.  Controllable Video Captioning With POS Sequence Guidance Based on Gated Fusion Network , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[51]  Xiaofei Zhou,et al.  Image Captioning with Context-Aware Auxiliary Guidance , 2020, ArXiv.

[52]  C. Lawrence Zitnick,et al.  CIDEr: Consensus-based image description evaluation , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[53]  Christoph Meinel,et al.  Image Captioning with Deep Bidirectional LSTMs , 2016, ACM Multimedia.

[54]  Vaibhava Goel,et al.  Self-Critical Sequence Training for Image Captioning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[55]  Tao Mei,et al.  Hierarchy Parsing for Image Captioning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).