Group-based Distinctive Image Captioning with Memory Attention

Describing images using natural language is widely known as image captioning, which has made consistent progress due to the development of computer vision and natural language generation techniques. Though conventional captioning models achieve high accuracy based on popular metrics, i.e., BLEU, CIDEr, and SPICE, the ability of captions to distinguish the target image from other similar images is under-explored. To generate distinctive captions, a few pioneers employ contrastive learning or re-weighted the ground-truth captions, which focuses on one single input image. However, the relationships between objects in a similar image group (e.g., items or properties within the same album or fine-grained events) are neglected. In this paper, we improve the distinctiveness of image captions using a Group-based Distinctive Captioning Model (GdisCap), which compares each image with other images in one similar group and highlights the uniqueness of each image. In particular, we propose a group-based memory attention (GMA) module, which stores object features that are unique among the image group (i.e., with low similarity to objects in other images). These unique object features are highlighted when generating captions, resulting in more distinctive captions. Furthermore, the distinctive words in the ground-truth captions are selected to supervise the language decoder and GMA. Finally, we propose a new evaluation metric, distinctive word rate (DisWordRate) to measure the distinctiveness of captions. Quantitative results indicate that the proposed method significantly improves the distinctiveness of several baseline models, and achieves the state-of-the-art performance on both accuracy and distinctiveness. Results of a user study agree with the quantitative evaluation and demonstrate the rationality of the new metric DisWordRate.

[1]  Gregory Shakhnarovich,et al.  Analysis of diversity-accuracy tradeoff in image captioning , 2020, ArXiv.

[2]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[3]  Gal Oren,et al.  Joint Optimization for Cooperative Image Captioning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[4]  Antoni B. Chan,et al.  On Diversity in Image Captioning: Metrics and Methods , 2020, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[5]  Sanja Fidler,et al.  Towards Diverse and Natural Image Descriptions via a Conditional GAN , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[6]  Jing Liu,et al.  Normalized and Geometry-Aware Self-Attention Network for Image Captioning , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Lei Zhang,et al.  Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[8]  David J. Fleet,et al.  VSE++: Improving Visual-Semantic Embeddings with Hard Negatives , 2017, BMVC.

[9]  Svetlana Lazebnik,et al.  Diverse and Accurate Image Description Using a Variational Auto-Encoder with an Additive Gaussian Encoding Space , 2017, NIPS.

[10]  Rongrong Ji,et al.  StructCap: Structured Semantic Embedding for Image Captioning , 2017, ACM Multimedia.

[11]  Wei Xu,et al.  Deep Captioning with Multimodal Recurrent Neural Networks (m-RNN) , 2014, ICLR.

[12]  Samy Bengio,et al.  Context-Aware Captions from Context-Agnostic Supervision , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Alexander G. Schwing,et al.  Creativity: Generating Diverse Questions Using Variational Autoencoders , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Antoni B. Chan,et al.  Gated Hierarchical Attention for Image Captioning , 2018, ACCV.

[15]  Bo Dai,et al.  Contrastive Learning for Image Captioning , 2017, NIPS.

[16]  Xiaogang Wang,et al.  Show, Tell and Discriminate: Image Captioning by Self-retrieval with Partially Labeled Data , 2018, ECCV.

[17]  Rita Cucchiara,et al.  Meshed-Memory Transformer for Image Captioning , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Gregory Shakhnarovich,et al.  Discriminability Objective for Training Descriptive Captions , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[19]  Tao Mei,et al.  X-Linear Attention Networks for Image Captioning , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Alexander G. Schwing,et al.  Convolutional Image Captioning , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[21]  Samy Bengio,et al.  Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Vaibhava Goel,et al.  Self-Critical Sequence Training for Image Captioning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Quan Hung Tran,et al.  Context-Aware Group Captioning via Self-Attention and Contrastive Features , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[25]  Antoni B. Chan,et al.  Compare and Reweight: Distinctive Image Captioning Using Similar Images Sets , 2020, ECCV.

[26]  Ashish Vaswani,et al.  Stand-Alone Self-Attention in Vision Models , 2019, NeurIPS.

[27]  Jiebo Luo,et al.  Image Captioning with Semantic Attention , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Antoni B. Chan,et al.  Describing Like Humans: On Diversity in Image Captioning , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Fei-Fei Li,et al.  Deep visual-semantic alignments for generating image descriptions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[31]  Yang Wang,et al.  Cross-Modal Self-Attention Network for Referring Image Segmentation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Georg Heigold,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2021, ICLR.

[33]  Furu Wei,et al.  VL-BERT: Pre-training of Generic Visual-Linguistic Representations , 2019, ICLR.

[34]  Antoni B. Chan,et al.  CNN+CNN: Convolutional Decoders for Image Captioning , 2018, CVPR 2018.

[35]  Tat-Seng Chua,et al.  SCA-CNN: Spatial and Channel-Wise Attention in Convolutional Networks for Image Captioning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Rongrong Ji,et al.  GroupCap: Group-Based Image Captioning with Structured Relevance and Diversity Constraints , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[37]  Yi Yang,et al.  Entangled Transformer for Image Captioning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[38]  Chenhui Chu,et al.  BERT Representations for Video Question Answering , 2020, 2020 IEEE Winter Conference on Applications of Computer Vision (WACV).

[39]  Antoni B. Chan,et al.  Towards Diverse and Accurate Image Captions via Reinforcing Determinantal Point Process , 2019, ArXiv.

[40]  Bernt Schiele,et al.  Speaking the Same Language: Matching Machine to Human Captions by Adversarial Training , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[41]  Jie Chen,et al.  Attention on Attention for Image Captioning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).