Research on Self-Attention Image Description Technology Based on Object Detection

Image description is transform the image into a text description that tells about the image, but the characters in the image and their attributes and relationships will affect the one-to-one correspondence between the text description and the image. In response to the above problems, we design an encode-decoder structure based on a self-aware target detector to extract distinct feature and role information, and design a role integrity check module to supplement the description sentences with rich object information, which improves the accuracy of image description for role in the image. We have conducted many experiments on Flickr8k and Flickr30k and on the MS COCO dataset, this method can make the generated image description and images have better semantic consistency, and compared with traditional deep learning methods and the semantic diversity has a good effect.

[1]  Lei Zhang,et al.  Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[2]  Jiebo Luo,et al.  Image Captioning with Semantic Attention , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Wei Liu,et al.  SSD: Single Shot MultiBox Detector , 2015, ECCV.

[4]  Ali Farhadi,et al.  You Only Look Once: Unified, Real-Time Object Detection , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[6]  Xinlei Chen,et al.  Learning a Recurrent Visual Representation for Image Caption Generation , 2014, ArXiv.

[7]  Samy Bengio,et al.  Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Wojciech Zaremba,et al.  Recurrent Neural Network Regularization , 2014, ArXiv.

[9]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[10]  Ruslan Salakhutdinov,et al.  Multimodal Neural Language Models , 2014, ICML.

[11]  Alon Lavie,et al.  Meteor Universal: Language Specific Translation Evaluation for Any Target Language , 2014, WMT@ACL.

[12]  Razvan Pascanu,et al.  How to Construct Deep Recurrent Neural Networks , 2013, ICLR.

[13]  Cyrus Rashtchian,et al.  Every Picture Tells a Story: Generating Sentences from Images , 2010, ECCV.

[14]  Alon Lavie,et al.  METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.

[15]  S. Hochreiter,et al.  Long Short-Term Memory , 1997, Neural Computation.

[16]  Jean Carletta,et al.  Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization , 2005, ACL 2005.