论文信息 - Generating image description by modeling spatial context of an image

Generating image description by modeling spatial context of an image

Generating the descriptive sentences of a real image is a challenging task in image understanding. The difficulty mainly lies in recognizing the interaction activities between objects, and predicting the relationship between objects and stuff/scene. In this paper, we propose a framework for improving image description generation by addressing the above problems. Our framework mainly includes two models: a unified spatial context model and an image description generation model. The former, as the centerpiece of our framework, models 3D spatial context to learn the human-object interaction activities and predict the semantic relationship between these activities and stuff/scene. The spatial context model casts the problems as latent structured labeling problems, and can be resolved by a unified mathematical optimization. Then based on the semantic relationship, the image description generation model generates image descriptive sentences through the proposed lexicalized tree-based algorithm. Experiments on a joint dataset show that our framework outperforms state-of-the-art methods in spatial co-occurrence context analysis, the human-object interaction recognition, and the image description generation.

Kan Li | Lin Bai | Kan Li | Lin Bai

[1] Fei-Fei Li,et al. Recognizing Human-Object Interactions in Still Images by Modeling the Mutual Context of Objects and Human Poses , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2] Karl Stratos,et al. Midge: Generating Image Descriptions From Computer Vision Detections , 2012, EACL.

[3] Nathan D. Ratliff,et al. Subgradient Methods for Maximum Margin Structured Learning , 2006 .

[4] Adam Kilgarriff,et al. of the European Chapter of the Association for Computational Linguistics , 2006 .

[5] Yejin Choi,et al. TreeTalk: Composition and Compression of Trees for Image Descriptions , 2014, TACL.

[6] Deva Ramanan,et al. Detecting Actions, Poses, and Objects with Relational Phraselets , 2012, ECCV.

[7] Frank Keller,et al. Image Description using Visual Dependency Representations , 2013, EMNLP.

[8] Armand Joulin,et al. Deep Fragment Embeddings for Bidirectional Image Sentence Mapping , 2014, NIPS.

[9] Ankush Gupta,et al. From Image Annotation to Image Description , 2012, ICONIP.

[10] Kan Li,et al. 3D Depth Perception from Single Monocular Images , 2015, MMM.

[11] Marwan Torki,et al. Human Action Recognition Using a Temporal Hierarchy of Covariance Descriptors on 3D Joint Locations , 2013, IJCAI.

[12] Cyrus Rashtchian,et al. Every Picture Tells a Story: Generating Sentences from Images , 2010, ECCV.

[13] Larry S. Davis,et al. Observing Human-Object Interactions: Using Spatial and Functional Compatibility for Recognition , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[14] Ruslan Salakhutdinov,et al. Multimodal Neural Language Models , 2014, ICML.

[15] Yejin Choi,et al. Baby talk: Understanding and generating simple image descriptions , 2011, CVPR 2011.

[16] Salim Roukos,et al. Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[17] Cordelia Schmid,et al. Explicit Modeling of Human-Object Interactions in Realistic Videos , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[18] Wei Xu,et al. Explain Images with Multimodal Recurrent Neural Networks , 2014, ArXiv.

[19] Fei-FeiLi,et al. Recognizing Human-Object Interactions in Still Images by Modeling the Mutual Context of Objects and Human Poses , 2012 .

[20] Samy Bengio,et al. Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21] Nanning Zheng,et al. Modeling 4D Human-Object Interactions for Event and Object Recognition , 2013, 2013 IEEE International Conference on Computer Vision.

[22] Fei-Fei Li,et al. Grouplet: A structured image representation for recognizing human and object interactions , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[23] Bernt Schiele,et al. Translating Video Content to Natural Language Descriptions , 2013, 2013 IEEE International Conference on Computer Vision.

[24] Ali Farhadi,et al. Recognition using visual phrases , 2011, CVPR 2011.

[25] Guodong Guo,et al. A survey on still image based human action recognition , 2014, Pattern Recognit..

[26] Yejin Choi,et al. Collective Generation of Natural Image Descriptions , 2012, ACL.