Hierarchy of Alternating Specialists for Scene Recognition

We introduce a method for improving convolutional neural networks (CNNs) for scene classification. We present a hierarchy of specialist networks, which disentangles the intra-class variation and inter-class similarity in a coarse to fine manner. Our key insight is that each subset within a class is often associated with different types of inter-class similarity. This suggests that existing network of experts approaches that organize classes into coarse categories are suboptimal. In contrast, we group images based on high-level appearance features rather than their class membership and dedicate a specialist model per group. In addition, we propose an alternating architecture with a global ordered- and a global orderless-representation to account for both the coarse layout of the scene and the transient objects. We demonstrate that it leads to better performance than using a single type of representation as well as the fused features. We also introduce a mini-batch soft k-means that allows end-to-end fine-tuning, as well as a novel routing function for assigning images to specialists. Experimental results show that the proposed approach achieves a significant improvement over baselines including the existing tree-structured CNNs with class-based grouping.

[1]  Zhuowen Tu,et al.  Probabilistic boosting-tree: learning discriminative models for classification, recognition, and clustering , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[2]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[3]  Yizhou Yu,et al.  Harvesting Discriminative Meta Objects with Deep CNN Features for Scene Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[4]  Limin Wang,et al.  Locally Supervised Deep Hybrid Model for Scene Recognition. , 2017, IEEE transactions on image processing : a publication of the IEEE Signal Processing Society.

[5]  Gunhee Kim,et al.  Taxonomy-Regularized Semantic Deep Convolutional Neural Networks , 2016, ECCV.

[6]  Kaiming He,et al.  Focal Loss for Dense Object Detection , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[7]  Jitendra Malik,et al.  Analyzing the Performance of Multilayer Neural Networks for Object Recognition , 2014, ECCV.

[8]  Qiang Chen,et al.  Network In Network , 2013, ICLR.

[9]  Antonio Torralba,et al.  Modeling the Shape of the Scene: A Holistic Representation of the Spatial Envelope , 2001, International Journal of Computer Vision.

[10]  Samy Bengio,et al.  Large-Scale Object Classification Using Label Relation Graphs , 2014, ECCV.

[11]  Ying Wu,et al.  Mining actionlet ensemble for action recognition with depth cameras , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[12]  Abhishek Das,et al.  Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[13]  Bolei Zhou,et al.  Network Dissection: Quantifying Interpretability of Deep Visual Representations , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Eric P. Xing,et al.  Large-Scale Category Structure Aware Image Categorization , 2011, NIPS.

[15]  Cordelia Schmid,et al.  Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[16]  Josef Sivic,et al.  NetVLAD: CNN Architecture for Weakly Supervised Place Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Antonio Torralba,et al.  Recognizing indoor scenes , 2009, CVPR.

[18]  Gong Cheng,et al.  RIFD-CNN: Rotation-Invariant and Fisher Discriminative Convolutional Neural Networks for Object Detection , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Nuno Vasconcelos,et al.  Object based Scene Representations using Fisher Scores of Local Subspace Projections , 2016, NIPS.

[20]  Gunhee Kim,et al.  SplitNet: Learning to Semantically Split Deep Networks for Parameter Reduction and Model Parallelization , 2017, ICML.

[21]  Rong Jin,et al.  Fine-grained visual categorization via multi-stage metric learning , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Jiwen Lu,et al.  Scene recognition with objectness , 2018, Pattern Recognit..

[23]  Lorenzo Torresani,et al.  BranchConnect: Large-Scale Visual Recognition with Learned Branch Connections , 2017, ArXiv.

[24]  Christopher M. Bishop,et al.  Bayesian Hierarchical Mixtures of Experts , 2002, UAI.

[25]  Matti Pietikäinen,et al.  Descriptor Learning Based on Fisher Separation Criterion for Texture Classification , 2010, ACCV.

[26]  Lorenzo Torresani,et al.  Network of Experts for Large-Scale Image Categorization , 2016, ECCV.

[27]  Dorin Comaniciu,et al.  Deep Decision Network for Multi-class Image Classification , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Kaiqi Huang,et al.  Beyond Triplet Loss: A Deep Quadruplet Network for Person Re-identification , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Bowen Zhang,et al.  Weakly Supervised PatchNets: Describing and Aggregating Local Patches for Scene Recognition , 2016, IEEE Transactions on Image Processing.

[30]  Christian Wolf,et al.  Modout: Learning Multi-Modal Architectures by Stochastic Regularization , 2017, 2017 12th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017).

[31]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Tinne Tuytelaars,et al.  Expert Gate: Lifelong Learning with a Network of Experts , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Matthew Richardson,et al.  Do Deep Convolutional Nets Really Need to be Deep and Convolutional? , 2016, ICLR.

[34]  Bolei Zhou,et al.  Learning Deep Features for Discriminative Localization , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Jianxin Wu,et al.  mCENTRIST: A Multi-Channel Feature Generation Mechanism for Scene Categorization , 2014, IEEE Transactions on Image Processing.

[36]  Bolei Zhou,et al.  Object Detectors Emerge in Deep Scene CNNs , 2014, ICLR.

[37]  Bolei Zhou,et al.  Learning Deep Features for Scene Recognition using Places Database , 2014, NIPS.

[38]  Nitish Srivastava,et al.  Discriminative Transfer Learning with Tree-based Priors , 2013, NIPS.

[39]  Robert A. Jacobs,et al.  Hierarchical Mixtures of Experts and the EM Algorithm , 1993, Neural Computation.

[40]  Robinson Piramuthu,et al.  HD-CNN: Hierarchical Deep Convolutional Neural Networks for Large Scale Visual Recognition , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[41]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[42]  Chandra Kambhamettu,et al.  Abstraction and Generalization of 3D Structure for Recognition in Large Intra-Class Variation , 2010, ACCV.

[43]  Marc'Aurelio Ranzato,et al.  Hard Mixtures of Experts for Large Scale Weakly Supervised Vision , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  Fatih Murat Porikli,et al.  Scene Categorization with Spectral Features , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[45]  C. V. Jawahar,et al.  Blocks That Shout: Distinctive Parts for Scene Classification , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[46]  Shaogang Gong,et al.  Person re-identification by probabilistic relative distance comparison , 2011, CVPR 2011.

[47]  Krista A. Ehinger,et al.  SUN database: Large-scale scene recognition from abbey to zoo , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[48]  Luis Herranz,et al.  Scene Recognition with CNNs: Objects, Scales and Dataset Bias , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[49]  Jana Kosecka,et al.  Deep Convolutional Features for Image Based Retrieval and Scene Categorization , 2015, ArXiv.

[50]  David G. Lowe,et al.  Object recognition from local scale-invariant features , 1999, Proceedings of the Seventh IEEE International Conference on Computer Vision.

[51]  Bolei Zhou,et al.  Places: An Image Database for Deep Scene Understanding , 2016, ArXiv.

[52]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[53]  Fei-Fei Li,et al.  Hierarchical semantic indexing for large scale image retrieval , 2011, CVPR 2011.

[54]  Lorenzo Torresani,et al.  BranchConnect: Image Categorization with Learned Branch Connections , 2018, 2018 IEEE Winter Conference on Applications of Computer Vision (WACV).

[55]  Nuno Vasconcelos,et al.  Scene classification with semantic Fisher vectors , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[56]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[57]  Bolei Zhou,et al.  Places: A 10 Million Image Database for Scene Recognition , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[58]  Limin Wang,et al.  Knowledge Guided Disambiguation for Large-Scale Scene Classification With Multi-Resolution CNNs. , 2017, IEEE transactions on image processing : a publication of the IEEE Signal Processing Society.

[59]  Alex Krizhevsky,et al.  One weird trick for parallelizing convolutional neural networks , 2014, ArXiv.

[60]  Dragomir Anguelov,et al.  Self-informed neural network structure learning , 2014, ICLR.

[61]  Leonid Sigal,et al.  A Unified Semantic Embedding: Relating Taxonomies and Attributes , 2014, NIPS.

[62]  Geoffrey E. Hinton,et al.  Dynamic Routing Between Capsules , 2017, NIPS.

[63]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[64]  Zhen Li,et al.  Blockout: Dynamic Model Selection for Hierarchical Deep Networks , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).