VIVO: Surpassing Human Performance in Novel Object Captioning with Visual Vocabulary Pre-Training

It is highly desirable yet challenging to generate image captions that can describe novel objects which are unseen in caption-labeled training data, a capability that is evaluated in the novel object captioning challenge (nocaps). In this challenge, no additional image-caption training data, other than COCO Captions, is allowed for model training. Thus, conventional Vision-Language Pre-training (VLP) methods cannot be applied. This paper presents VIsual VOcabulary pre-training (VIVO) that performs pre-training in the absence of caption annotations. By breaking the dependency of paired image-caption training data in VLP, VIVO can leverage large amounts of paired image-tag data to learn a visual vocabulary. This is done by pre-training a multi-layer Transformer model that learns to align image-level tags with their corresponding image region features. To address the unordered nature of image tags, VIVO uses a Hungarian matching loss with masked tag prediction to conduct pre-training. We validate the effectiveness of VIVO by fine-tuning the pre-trained model for image captioning. In addition, we perform an analysis of the visual-text alignment inferred by our model. The results show that our model can not only generate fluent image captions that describe novel objects, but also identify the locations of these objects. Our single model has achieved new state-of-the-art results on nocaps and surpassed the human CIDEr score.

[1]  Ronghang Hu,et al.  TextCaps: a Dataset for Image Captioning with Reading Comprehension , 2020, ECCV.

[2]  Jian Sun,et al.  Objects365: A Large-Scale, High-Quality Dataset for Object Detection , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[3]  Tom Sercu,et al.  Adversarial Semantic Alignment for Improved Image Captions , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Cyrus Rashtchian,et al.  Every Picture Tells a Story: Generating Sentences from Images , 2010, ECCV.

[5]  Karl Stratos,et al.  Midge: Generating Image Descriptions From Computer Vision Detections , 2012, EACL.

[6]  Tao Mei,et al.  Incorporating Copying Mechanism in Image Captioning for Learning Novel Objects , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Haifeng Hu,et al.  Hierarchical Attention Network for Image Captioning , 2019, AAAI.

[8]  Alec Radford,et al.  Improving Language Understanding by Generative Pre-Training , 2018 .

[9]  Ghassan AlRegib,et al.  Learning to Generate Grounded Visual Captions Without Localization Supervision , 2019, ECCV.

[10]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[11]  Jianfeng Gao,et al.  Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks , 2020, ECCV.

[12]  Jing Liu,et al.  Normalized and Geometry-Aware Self-Attention Network for Image Captioning , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Jianfeng Gao,et al.  Unified Vision-Language Pre-Training for Image Captioning and VQA , 2020, AAAI.

[14]  Hanwang Zhang,et al.  More Grounded Image Captioning by Distilling Image-Text Matching Model , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Rita Cucchiara,et al.  Meshed-Memory Transformer for Image Captioning , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Mohit Bansal,et al.  LXMERT: Learning Cross-Modality Encoder Representations from Transformers , 2019, EMNLP.

[17]  Yejin Choi,et al.  Baby talk: Understanding and generating simple image descriptions , 2011, CVPR 2011.

[18]  Jungong Han,et al.  Learning Object Context for Dense Captioning , 2019, AAAI.

[19]  Furu Wei,et al.  VL-BERT: Pre-training of Generic Visual-Linguistic Representations , 2019, ICLR.

[20]  Xinlei Chen,et al.  Microsoft COCO Captions: Data Collection and Evaluation Server , 2015, ArXiv.

[21]  Jianwei Yang,et al.  Neural Baby Talk , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[22]  Buyue Qian,et al.  Connecting Language to Images: A Progressive Attention-Guided Network for Simultaneous Image Captioning and Language Grounding , 2019, AAAI.

[23]  Licheng Yu,et al.  UNITER: Learning UNiversal Image-TExt Representations , 2019, ArXiv.

[24]  George Kurian,et al.  Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.

[25]  Chen Chen,et al.  Improving Image Captioning with Conditional Generative Adversarial Nets , 2018, AAAI.

[26]  Radu Soricut,et al.  Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning , 2018, ACL.

[27]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[28]  Trevor Darrell,et al.  Deep Compositional Captioning: Describing Novel Object Categories without Paired Training Data , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Nicolas Usunier,et al.  End-to-End Object Detection with Transformers , 2020, ECCV.

[30]  Geoffrey Zweig,et al.  From captions to visual concepts and back , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Vaibhava Goel,et al.  Self-Critical Sequence Training for Image Captioning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[33]  Yi Yang,et al.  Decoupled Novel Object Captioner , 2018, ACM Multimedia.

[34]  Basura Fernando,et al.  Guided Open Vocabulary Image Captioning with Constrained Beam Search , 2016, EMNLP.

[35]  Jie Chen,et al.  Attention on Attention for Image Captioning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[36]  Yejin Choi,et al.  Collective Generation of Natural Image Descriptions , 2012, ACL.

[37]  Andrew Y. Ng,et al.  End-to-End People Detection in Crowded Scenes , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Fei-Fei Li,et al.  Deep visual-semantic alignments for generating image descriptions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  Di Jin,et al.  Fashion Captioning: Towards Generating Accurate Descriptions with Semantic Rewards , 2020, ECCV.

[40]  Nannan Li,et al.  Meta Learning for Image Captioning , 2019, AAAI.

[41]  Jian Sun,et al.  Rich Image Captioning in the Wild , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[42]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[43]  Peter Young,et al.  From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions , 2014, TACL.

[44]  Lei Zhang,et al.  Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[45]  Trevor Darrell,et al.  Captioning Images with Diverse Objects , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[46]  Yiannis Aloimonos,et al.  Corpus-Guided Sentence Generation of Natural Images , 2011, EMNLP.

[47]  Nenghai Yu,et al.  Context and Attribute Grounded Dense Captioning , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[48]  Heng Tao Shen,et al.  Deliberate Attention Networks for Image Captioning , 2019, AAAI.

[49]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[50]  Li Fei-Fei,et al.  DenseCap: Fully Convolutional Localization Networks for Dense Captioning , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[51]  Stefan Lee,et al.  ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks , 2019, NeurIPS.

[52]  Xinlei Chen,et al.  nocaps: novel object captioning at scale , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[53]  Tao Mei,et al.  X-Linear Attention Networks for Image Captioning , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).