Understanding image concepts using ISTOP model

This paper focuses on recognizing image concepts by introducing the ISTOP model. The model parses the images from scene to object's parts by using a context sensitive grammar. Since there is a gap between the scene and object levels, this grammar proposes the "Visual Term" level to bridge the gap. Visual term is a higher concept level than the object level representing a few co-occurring objects. The grammar used in the model can be embodied in an And-Or graph representation. The hierarchical structure of the graph decomposes an image from the scene level into the visual term, object level and part level by terminal and non-terminal nodes, while the horizontal links in the graph impose the context and constraints between the nodes. In order to learn the grammar constraints and their weights, we propose an algorithm that can perform on weakly annotated datasets. This algorithm searches in the dataset to find visual terms without supervision and then learns the weights of the constraints using a latent SVM. The experimental results on the Pascal VOC dataset show that our model outperforms the state-of-the-art approaches in recognizing image concepts. HighlightsIn understanding an image there is a significant gap between scene level and object level.ISTOP model can parse an image form scene level to visual term, object and part level by context sensitive grammar.The visual term is a new concept which can bridge the gap between scene level and object level.The context used in the grammar can improve object detection as well as visual term detection.

[1]  David A. McAllester,et al.  Object Detection with Discriminatively Trained Part Based Models , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2]  Deva Ramanan,et al.  Detecting Actions, Poses, and Objects with Relational Phraselets , 2012, ECCV.

[3]  TouschAnne-Marie,et al.  Semantic hierarchies for image annotation , 2012 .

[4]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[5]  Alexei A. Efros,et al.  Unsupervised Discovery of Mid-Level Discriminative Patches , 2012, ECCV.

[6]  Luc Van Gool,et al.  The Pascal Visual Object Classes (VOC) Challenge , 2010, International Journal of Computer Vision.

[7]  Stéphane Herbin,et al.  Semantic hierarchies for image annotation: A survey , 2012, Pattern Recognit..

[8]  Fahad Shahbaz Khan,et al.  Discriminative compact pyramids for object and scene recognition , 2012, Pattern Recognition.

[9]  Thorsten Joachims,et al.  Learning structural SVMs with latent variables , 2009, ICML '09.

[10]  Ali Farhadi,et al.  Recognition using visual phrases , 2011, CVPR 2011.

[11]  Song-Chun Zhu,et al.  Image Parsing with Stochastic Scene Grammar , 2011, NIPS.

[12]  Fei-Fei Li,et al.  Recognizing Human-Object Interactions in Still Images by Modeling the Mutual Context of Objects and Human Poses , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[13]  Yi Yang,et al.  Articulated pose estimation with flexible mixtures-of-parts , 2011, CVPR 2011.

[14]  David A. McAllester,et al.  A discriminatively trained, multiscale, deformable part model , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[15]  Feng Han,et al.  Bottom-Up/Top-Down Image Parsing with Attribute Grammar , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[16]  Deva Ramanan,et al.  Detecting activities of daily living in first-person camera views , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[17]  Fei-Fei Li,et al.  Modeling mutual context of object and human pose in human-object interaction activities , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[18]  Nanning Zheng,et al.  Learning group-based dictionaries for discriminative image representation , 2014, Pattern Recognit..

[19]  Md. Monirul Islam,et al.  A review on automatic image annotation techniques , 2012, Pattern Recognit..

[20]  Jitendra Malik,et al.  Poselets: Body part detectors trained using 3D human pose annotations , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[21]  Carsten Rother,et al.  Learning discriminative localization from weakly labeled data , 2014, Pattern Recognit..

[22]  Marcel Worring,et al.  This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Harvesting Social Images for Bi-Concept Search , 2022 .