Gated Object-Attribute Matching Network for Detailed Image Caption

Image caption enables computers to generate a text description of images automatically. However, the generated description is not good enough recently. Computers can describe what objects are in the image but cannot give more details about these objects. In this study, we present a novel image caption approach to give more details when describing objects. In detail, a visual attention-based LSTM is used to find the objects, as well as a semantic attention-based LSTM is used for giving semantic attributes. At last, a gated object-attribute matching network is used to match the objects to their semantic attributes. The experiments on the public datasets of Flickr30k and MSCOCO demonstrate that the proposed approach improved the quality of the image caption, compared with the most advanced methods at present.

[1]  Chunhua Shen,et al.  What Value Do Explicit High Level Concepts Have in Vision to Language Problems? , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[5]  Koen E. A. van de Sande,et al.  Selective Search for Object Recognition , 2013, International Journal of Computer Vision.

[6]  Yoshua Bengio,et al.  Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling , 2014, ArXiv.

[7]  Marc'Aurelio Ranzato,et al.  Sequence Level Training with Recurrent Neural Networks , 2015, ICLR.

[8]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[9]  C. Lawrence Zitnick,et al.  CIDEr: Consensus-based image description evaluation , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Ruslan Salakhutdinov,et al.  Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models , 2014, ArXiv.

[11]  Ruslan Salakhutdinov,et al.  Multimodal Neural Language Models , 2014, ICML.

[12]  Wei Xu,et al.  Deep Captioning with Multimodal Recurrent Neural Networks (m-RNN) , 2014, ICLR.

[13]  Jiebo Luo,et al.  Image Captioning with Semantic Attention , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Ning Zhang,et al.  Deep Reinforcement Learning-Based Image Captioning with Embedding Reward , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Peter Young,et al.  From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions , 2014, TACL.

[16]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[17]  Shuohang Wang,et al.  Machine Comprehension Using Match-LSTM and Answer Pointer , 2016, ICLR.

[18]  Samy Bengio,et al.  Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[20]  Siqi Liu,et al.  Improved Image Captioning via Policy Gradient optimization of SPIDEr , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[21]  Nazli Ikizler-Cinbis,et al.  Automatic Description Generation from Images: A Survey of Models, Datasets, and Evaluation Measures , 2016, J. Artif. Intell. Res..

[22]  Changshui Zhang,et al.  Aligning where to see and what to tell: image caption with region-based attention and scene factorization , 2015, ArXiv.

[23]  Jianwei Yang,et al.  Neural Baby Talk , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[24]  Zhe Gan,et al.  Semantic Compositional Networks for Visual Captioning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Yejin Choi,et al.  Baby talk: Understanding and generating simple image descriptions , 2011, CVPR 2011.

[26]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[27]  Basura Fernando,et al.  SPICE: Semantic Propositional Image Caption Evaluation , 2016, ECCV.

[28]  Tao Mei,et al.  Boosting Image Captioning with Attributes , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[29]  Mo Yu,et al.  Improving Reinforcement Learning Based Image Captioning with Natural Language Prior , 2018, EMNLP.

[30]  Richard Socher,et al.  Knowing When to Look: Adaptive Attention via a Visual Sentinel for Image Captioning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Geoffrey Zweig,et al.  From captions to visual concepts and back , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Vaibhava Goel,et al.  Self-Critical Sequence Training for Image Captioning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).