Hierarchical deep semantic representation for visual categorization

Abstract Visual features are unsatisfactory to effectively describe the visual semantics. However, single layer based semantic modeling may be not able to cope with complicated semantic contents. In this paper, we propose Hierarchical Deep Semantic Representation (H-DSR), a hierarchical framework which combines semantic context modeling with visual features. First, the input image is sampled with spatially fixed grids. Deep features are then extracted for each sample in particular location. Second, using pre-learned classifiers, a detection response map is constructed for each patch. Semantic representation is then extracted from the map, which have a sense of latent semantic context. We combine the semantic and visual representations for joint representation. Third, a hierarchical deep semantic representation is built with recurrent reconstructions using three layers. The concatenated visual and semantic representations are used as the inputs of subsequent layers for semantic representation extraction. Finally, we verify the effectiveness of H-DSR for visual categorization on two publicly available datasets: Oxford Flowers 17 and UIUC-Sports. Improved performances are obtained over many baseline methods.

[1]  David G. Lowe,et al.  Object recognition from local scale-invariant features , 1999, Proceedings of the Seventh IEEE International Conference on Computer Vision.

[2]  Hao Su,et al.  Objects as Attributes for Scene Classification , 2010, ECCV Workshops.

[3]  Qi Tian,et al.  Beyond visual features: A weak semantic image representation using exemplar classifiers for classification , 2013, Neurocomputing.

[4]  Bolei Zhou,et al.  Learning Deep Features for Scene Recognition using Places Database , 2014, NIPS.

[5]  Josef Kittler,et al.  Two-Stage Augmented Kernel Matrix for Object Recognition , 2011, MCS.

[6]  M. Shamim Hossain,et al.  Learning Feature Hierarchies: A Layer-Wise Tag-Embedded Approach , 2015, IEEE Transactions on Multimedia.

[7]  Fereshteh Sadeghi,et al.  Latent Pyramidal Regions for Recognizing Scenes , 2012, ECCV.

[8]  Hao Su,et al.  Object Bank: A High-Level Image Representation for Scene Classification & Semantic Feature Sparsification , 2010, NIPS.

[9]  C. Lawrence Zitnick,et al.  Adopting Abstract Images for Semantic Scene Understanding , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[10]  Anand Rajaraman,et al.  Mining of Massive Datasets , 2011 .

[11]  Qingming Huang,et al.  Image classification by non-negative sparse coding, correlation constrained low-rank and sparse decomposition , 2014, Comput. Vis. Image Underst..

[12]  Cordelia Schmid,et al.  Aggregating local descriptors into a compact image representation , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[13]  Cordelia Schmid,et al.  Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[14]  Changsheng Xu,et al.  Learn to Personalized Image Search From the Photo Sharing Websites , 2012, IEEE Transactions on Multimedia.

[15]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[16]  Meng Wang,et al.  Adaptive Hypergraph Learning and its Application in Image Classification , 2012, IEEE Transactions on Image Processing.

[17]  Ronggang Wang,et al.  An improved averaging combination method for image and object recognition , 2015, 2015 IEEE International Conference on Multimedia & Expo Workshops (ICMEW).

[18]  Qi Tian,et al.  Beyond Explicit Codebook Generation: Visual Representation Using Implicitly Transferred Codebooks , 2015, IEEE Transactions on Image Processing.

[19]  Trevor Darrell,et al.  Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[20]  Qi Tian,et al.  Object categorization in sub-semantic space , 2014, Neurocomputing.

[21]  Alexandros Labrinidis,et al.  Challenges and Opportunities with Big Data , 2012, Proc. VLDB Endow..

[22]  Yihong Gong,et al.  Linear spatial pyramid matching using sparse coding for image classification , 2009, CVPR.

[23]  C. V. Jawahar,et al.  Blocks That Shout: Distinctive Parts for Scene Classification , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[24]  Andrew Zisserman,et al.  A Visual Vocabulary for Flower Classification , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[25]  Alexei A. Efros,et al.  Unsupervised Discovery of Mid-Level Discriminative Patches , 2012, ECCV.

[26]  Qingming Huang,et al.  Distributed image understanding with semantic dictionary and semantic expansion , 2016, Neurocomputing.

[27]  Xin Li,et al.  Latent Semantic Representation Learning for Scene Classification , 2014, ICML.

[28]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[29]  Qi Tian,et al.  Joint image representation and classification in random semantic spaces , 2015, Neurocomputing.

[30]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[31]  Qi Tian,et al.  Contextual Exemplar Classifier-Based Image Representation for Classification , 2017, IEEE Transactions on Circuits and Systems for Video Technology.

[32]  Gustavo Camps-Valls,et al.  Semi-Supervised Graph-Based Hyperspectral Image Classification , 2007, IEEE Transactions on Geoscience and Remote Sensing.

[33]  Ke Lu,et al.  $p$-Laplacian Regularized Sparse Coding for Human Activity Recognition , 2016, IEEE Transactions on Industrial Electronics.

[34]  Andrew Zisserman,et al.  Video Google: a text retrieval approach to object matching in videos , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[35]  Yizhou Yu,et al.  Harvesting Discriminative Meta Objects with Deep CNN Features for Scene Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[36]  Shengcai Liao,et al.  Adaptive object classification in surveillance system by exploiting scene context , 2009, 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[37]  Hongping Cai,et al.  ℓp norm multiple kernel Fisher discriminant analysis for object and image categorisation , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[38]  Svetlana Lazebnik,et al.  Multi-scale Orderless Pooling of Deep Convolutional Activation Features , 2014, ECCV.

[39]  Yue Gao,et al.  Recent advances in social multimedia big data mining and applications , 2015, Multimedia Systems.

[40]  Antonio Torralba,et al.  Modeling the Shape of the Scene: A Holistic Representation of the Spatial Envelope , 2001, International Journal of Computer Vision.

[41]  Dieter Fox,et al.  Hierarchical Matching Pursuit for Image Classification: Architecture and Fast Algorithms , 2011, NIPS.

[42]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[43]  Qi Tian,et al.  Fine-Grained Image Classification via Low-Rank Sparse Coding With General and Class-Specific Codebooks , 2017, IEEE Transactions on Neural Networks and Learning Systems.

[44]  José Eladio Medina-Pagola,et al.  Frequent approximate subgraphs as features for graph-based image classification , 2012, Knowl. Based Syst..

[45]  Yu Liu,et al.  Cross-Level: A Practical Strategy for Convolutional Neural Networks Based Image Classification , 2015, CCCV.

[46]  Meng Wang,et al.  Semantic embedding for indoor scene recognition by weighted hypergraph learning , 2015, Signal Process..

[47]  Yan Liu,et al.  A Unified Framework of Latent Feature Learning in Social Media , 2014, IEEE Transactions on Multimedia.

[48]  Manik Varma,et al.  Learning The Discriminative Power-Invariance Trade-Off , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[49]  Trevor Darrell,et al.  The pyramid match kernel: discriminative classification with sets of image features , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[50]  Sebastian Nowozin,et al.  On feature combination for multiclass object classification , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[51]  M. Shamim Hossain,et al.  Folksonomy-Based Visual Ontology Construction and Its Applications , 2016, IEEE Transactions on Multimedia.

[52]  Jun Yu,et al.  High-level attributes modeling for indoor scenes classification , 2013, Neurocomputing.

[53]  Fei-Fei Li,et al.  What, where and who? Classifying events by scene and object recognition , 2007, 2007 IEEE 11th International Conference on Computer Vision.