论文信息 - Discriminability Objective for Training Descriptive Captions

Discriminability Objective for Training Descriptive Captions

One property that remains lacking in image captions generated by contemporary methods is discriminability: being able to tell two images apart given the caption for one of them. We propose a way to improve this aspect of caption generation. By incorporating into the captioning training objective a loss component directly related to ability (by a machine) to disambiguate image/caption matches, we obtain systems that produce much more discriminative caption, according to human evaluation. Remarkably, our approach leads to improvement in other aspects of generated captions, reflected by a battery of standard scores such as BLEU, SPICE etc. Our approach is modular and can be applied to a variety of model/loss combinations commonly proposed for image captioning.

[1] Trevor Darrell,et al. Generating Visual Explanations , 2016, ECCV.

[2] C. Lawrence Zitnick,et al. CIDEr: Consensus-based image description evaluation , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3] Pietro Perona,et al. Microsoft COCO: Common Objects in Context , 2014, ECCV.

[4] R. J. Williams,et al. Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[5] Chenxi Liu,et al. Attention Correctness in Neural Image Captioning , 2016, AAAI.

[6] Lex Weaver,et al. The Optimal Reward Baseline for Gradient-Based Reinforcement Learning , 2001, UAI.

[7] Andrew Zisserman,et al. Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[8] Jiasen Lu,et al. Best of Both Worlds: Transferring Knowledge from Discriminative Learning to a Generative Visual Dialog Model , 2017, NIPS.

[9] Sanja Fidler,et al. Order-Embeddings of Images and Language , 2015, ICLR.

[10] Pratik Rane,et al. Self-Critical Sequence Training for Image Captioning , 2018 .

[11] David J. Fleet,et al. VSE++: Improving Visual-Semantic Embeddings with Hard Negatives , 2017, BMVC.

[12] Yoshua Bengio,et al. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[13] Ye Yuan,et al. Review Networks for Caption Generation , 2016, NIPS.

[14] Yee Whye Teh,et al. The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables , 2016, ICLR.

[15] Dan Klein,et al. Reasoning about Pragmatics with Neural Listeners and Speakers , 2016, EMNLP.

[16] Feng Liu,et al. Actor-Critic Sequence Training for Image Captioning , 2017, ArXiv.

[17] Alan L. Yuille,et al. Generation and Comprehension of Unambiguous Object Descriptions , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18] Ben Poole,et al. Categorical Reparameterization with Gumbel-Softmax , 2016, ICLR.

[19] Yin Li,et al. Learning Deep Structure-Preserving Image-Text Embeddings , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20] Gang Wang,et al. Stack-Captioning: Coarse-to-Fine Learning for Image Captioning , 2017, AAAI.

[21] Licheng Yu,et al. A Joint Speaker-Listener-Reinforcer Model for Referring Expressions , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22] Samy Bengio,et al. Context-Aware Captions from Context-Agnostic Supervision , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23] Marc'Aurelio Ranzato,et al. Sequence Level Training with Recurrent Neural Networks , 2015, ICLR.

[24] Kaiming He,et al. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[25] Richard Socher,et al. Knowing When to Look: Adaptive Attention via a Visual Sentinel for Image Captioning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26] Licheng Yu,et al. Modeling Context in Referring Expressions , 2016, ECCV.

[27] Bernt Schiele,et al. Speaking the Same Language: Matching Machine to Human Captions by Adversarial Training , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[28] Ning Zhang,et al. Deep Reinforcement Learning-Based Image Captioning with Embedding Reward , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29] Garrison W. Cottrell,et al. Skeleton Key: Image Captioning by Skeleton-Attribute Decomposition , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30] Lei Zhang,et al. Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[31] Salim Roukos,et al. Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[32] Basura Fernando,et al. SPICE: Semantic Propositional Image Caption Evaluation , 2016, ECCV.

[33] Sanja Fidler,et al. Teaching Machines to Describe Images via Natural Language Feedback , 2017, ArXiv.

[34] Jason Weston,et al. WSABIE: Scaling Up to Large Vocabulary Image Annotation , 2011, IJCAI.

[35] David J. Fleet,et al. VSE++: Improved Visual-Semantic Embeddings , 2017, ArXiv.

[36] Svetlana Lazebnik,et al. Diverse and Accurate Image Description Using a Variational Auto-Encoder with an Additive Gaussian Encoding Space , 2017, NIPS.

[37] Stefan Lee,et al. Learning Cooperative Visual Dialog Agents with Deep Reinforcement Learning , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[38] Siqi Liu,et al. Improved Image Captioning via Policy Gradient optimization of SPIDEr , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[39] Donald Geman,et al. Visual Turing test for computer vision systems , 2015, Proceedings of the National Academy of Sciences.

[40] Xin Wang,et al. Video Captioning via Hierarchical Reinforcement Learning , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[41] Samy Bengio,et al. Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[42] Sanja Fidler,et al. Towards Diverse and Natural Image Descriptions via a Conditional GAN , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[43] Wei Xu,et al. Deep Captioning with Multimodal Recurrent Neural Networks (m-RNN) , 2014, ICLR.

[44] Vicente Ordonez,et al. ReferItGame: Referring to Objects in Photographs of Natural Scenes , 2014, EMNLP.

[45] Chen Huang,et al. Learning to Disambiguate by Asking Discriminative Questions , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[46] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[47] Chin-Yew Lin,et al. ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[48] Lin Ma,et al. Multimodal Convolutional Neural Networks for Matching Image and Sentence , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[49] Marc'Aurelio Ranzato,et al. DeViSE: A Deep Visual-Semantic Embedding Model , 2013, NIPS.

[50] Lei Zhang,et al. Bottom-Up and Top-Down Attention for Image Captioning and VQA , 2017, ArXiv.

[51] Alon Lavie,et al. Meteor Universal: Language Specific Translation Evaluation for Any Target Language , 2014, WMT@ACL.

[52] Gregory Shakhnarovich,et al. Comprehension-Guided Referring Expressions , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[53] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[54] Fei-Fei Li,et al. Deep visual-semantic alignments for generating image descriptions , 2015, CVPR.

[55] Jung-Woo Ha,et al. Dual Attention Networks for Multimodal Reasoning and Matching , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[56] Tao Mei,et al. Boosting Image Captioning with Attributes , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[57] Liwei Wang,et al. Learning Two-Branch Neural Networks for Image-Text Matching Tasks , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[58] Hugo Larochelle,et al. GuessWhat?! Visual Object Discovery through Multi-modal Dialogue , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[59] James Philbin,et al. FaceNet: A unified embedding for face recognition and clustering , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[60] Michael S. Bernstein,et al. Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , 2016, International Journal of Computer Vision.

[61] Bo Dai,et al. Contrastive Learning for Image Captioning , 2017, NIPS.