Generating image description by modeling spatial context of an image

Generating the descriptive sentences of a real image is a challenging task in image understanding. The difficulty mainly lies in recognizing the interaction activities between objects, and predicting the relationship between objects and stuff/scene. In this paper, we propose a framework for improving image description generation by addressing the above problems. Our framework mainly includes two models: a unified spatial context model and an image description generation model. The former, as the centerpiece of our framework, models 3D spatial context to learn the human-object interaction activities and predict the semantic relationship between these activities and stuff/scene. The spatial context model casts the problems as latent structured labeling problems, and can be resolved by a unified mathematical optimization. Then based on the semantic relationship, the image description generation model generates image descriptive sentences through the proposed lexicalized tree-based algorithm. Experiments on a joint dataset show that our framework outperforms state-of-the-art methods in spatial co-occurrence context analysis, the human-object interaction recognition, and the image description generation.

[1]  Fei-Fei Li,et al.  Recognizing Human-Object Interactions in Still Images by Modeling the Mutual Context of Objects and Human Poses , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2]  Karl Stratos,et al.  Midge: Generating Image Descriptions From Computer Vision Detections , 2012, EACL.

[3]  Nathan D. Ratliff,et al.  Subgradient Methods for Maximum Margin Structured Learning , 2006 .

[4]  Adam Kilgarriff,et al.  of the European Chapter of the Association for Computational Linguistics , 2006 .

[5]  Yejin Choi,et al.  TreeTalk: Composition and Compression of Trees for Image Descriptions , 2014, TACL.

[6]  Deva Ramanan,et al.  Detecting Actions, Poses, and Objects with Relational Phraselets , 2012, ECCV.

[7]  Frank Keller,et al.  Image Description using Visual Dependency Representations , 2013, EMNLP.

[8]  Armand Joulin,et al.  Deep Fragment Embeddings for Bidirectional Image Sentence Mapping , 2014, NIPS.

[9]  Ankush Gupta,et al.  From Image Annotation to Image Description , 2012, ICONIP.

[10]  Kan Li,et al.  3D Depth Perception from Single Monocular Images , 2015, MMM.

[11]  Marwan Torki,et al.  Human Action Recognition Using a Temporal Hierarchy of Covariance Descriptors on 3D Joint Locations , 2013, IJCAI.

[12]  Cyrus Rashtchian,et al.  Every Picture Tells a Story: Generating Sentences from Images , 2010, ECCV.

[13]  Larry S. Davis,et al.  Observing Human-Object Interactions: Using Spatial and Functional Compatibility for Recognition , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[14]  Ruslan Salakhutdinov,et al.  Multimodal Neural Language Models , 2014, ICML.

[15]  Yejin Choi,et al.  Baby talk: Understanding and generating simple image descriptions , 2011, CVPR 2011.

[16]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[17]  Cordelia Schmid,et al.  Explicit Modeling of Human-Object Interactions in Realistic Videos , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[18]  Wei Xu,et al.  Explain Images with Multimodal Recurrent Neural Networks , 2014, ArXiv.

[19]  Fei-FeiLi,et al.  Recognizing Human-Object Interactions in Still Images by Modeling the Mutual Context of Objects and Human Poses , 2012 .

[20]  Samy Bengio,et al.  Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Nanning Zheng,et al.  Modeling 4D Human-Object Interactions for Event and Object Recognition , 2013, 2013 IEEE International Conference on Computer Vision.

[22]  Fei-Fei Li,et al.  Grouplet: A structured image representation for recognizing human and object interactions , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[23]  Bernt Schiele,et al.  Translating Video Content to Natural Language Descriptions , 2013, 2013 IEEE International Conference on Computer Vision.

[24]  Ali Farhadi,et al.  Recognition using visual phrases , 2011, CVPR 2011.

[25]  Guodong Guo,et al.  A survey on still image based human action recognition , 2014, Pattern Recognit..

[26]  Yejin Choi,et al.  Collective Generation of Natural Image Descriptions , 2012, ACL.