Paragraph Generation Network with Visual Relationship Detection

Paragraph generation of images is a new concept, aiming to produce multiple sentences to describe a given image. In this paper, we propose a paragraph generation network with introducing visual relationship detection. We first detect regions which may contain important visual objects and then predict their relationships. Paragraphs are produced based on object regions which have valid relationship with others. Compared with previous works which generate sentences based on region features, we explicitly explore and utilize visual relationships in order to improve final captions. The experimental results show that such strategy could improve paragraph generating performance from two aspects: more details about object relations are detected and more accurate sentences are obtained. Furthermore, our model is more robust to region detection fluctuation.

[1]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[3]  Li-Jia Li,et al.  Dense Captioning with Joint Inference and Visual Context , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Michael S. Bernstein,et al.  Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , 2016, International Journal of Computer Vision.

[5]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[6]  Shih-Fu Chang,et al.  Visual Translation Embedding Network for Visual Relation Detection , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Stephen Gould,et al.  Multi-Class Segmentation with Relative Location Prior , 2008, International Journal of Computer Vision.

[8]  Marc'Aurelio Ranzato,et al.  DeViSE: A Deep Visual-Semantic Embedding Model , 2013, NIPS.

[9]  Bo Dai,et al.  Detecting Visual Relationships with Deep Relational Networks , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Michael S. Bernstein,et al.  Visual Relationship Detection with Language Priors , 2016, ECCV.

[11]  C. Lawrence Zitnick,et al.  CIDEr: Consensus-based image description evaluation , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Jonathan Krause,et al.  A Hierarchical Approach for Generating Descriptive Image Paragraphs , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Ji Zhang,et al.  Relationship Proposal Networks , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Tat-Seng Chua,et al.  SCA-CNN: Spatial and Channel-Wise Attention in Convolutional Networks for Image Captioning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Li Fei-Fei,et al.  DenseCap: Fully Convolutional Localization Networks for Dense Captioning , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Wei Xu,et al.  Explain Images with Multimodal Recurrent Neural Networks , 2014, ArXiv.

[17]  Samy Bengio,et al.  Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Alon Lavie,et al.  Meteor Universal: Language Specific Translation Evaluation for Any Target Language , 2014, WMT@ACL.

[19]  Peter Young,et al.  Framing Image Description as a Ranking Task: Data, Models and Evaluation Metrics , 2013, J. Artif. Intell. Res..

[20]  Fei-Fei Li,et al.  Deep visual-semantic alignments for generating image descriptions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Cordelia Schmid,et al.  Areas of Attention for Image Captioning , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[22]  Richard Socher,et al.  Knowing When to Look: Adaptive Attention via a Visual Sentinel for Image Captioning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Larry S. Davis,et al.  Observing Human-Object Interactions: Using Spatial and Functional Compatibility for Recognition , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.