Deep Mixture of Diverse Experts for Large-Scale Visual Recognition

In this paper, a deep mixture of diverse experts algorithm is developed to achieve more efficient learning of a huge (mixture) network for large-scale visual recognition application. First, a two-layer ontology is constructed to assign large numbers of atomic object classes into a set of task groups according to the similarities of their learning complexities, where certain degrees of inter-group task overlapping are allowed to enable sufficient inter-group message passing. Second, one particular base deep CNNs with <inline-formula><tex-math notation="LaTeX">$M+1$</tex-math><alternatives> <mml:math> <mml:mrow> <mml:mi>M</mml:mi> <mml:mo>+</mml:mo> <mml:mn>1</mml:mn> </mml:mrow> </mml:math> <inline-graphic xlink:href="fan-ieq1-2828821.gif"/></alternatives></inline-formula> outputs is learned for each task group to recognize its <inline-formula><tex-math notation="LaTeX">$M$</tex-math><alternatives> <mml:math> <mml:mi>M</mml:mi> </mml:math> <inline-graphic xlink:href="fan-ieq2-2828821.gif"/></alternatives></inline-formula> atomic object classes and identify one special class of “not-in-group”, where the network structure (numbers of layers and units in each layer) of the well-designed deep CNNs (such as AlexNet, VGG, GoogleNet, ResNet) is directly used to configure such base deep CNNs. For enhancing the separability of the atomic object classes in the same task group, two approaches are developed to learn more discriminative base deep CNNs: (a) our deep multi-task learning algorithm that can effectively exploit the inter-class visual similarities; (b) our two-layer network cascade approach that can improve the accuracy rates for the hard object classes at certain degrees while effectively maintaining the high accuracy rates for the easy ones. Finally, all these complementary base deep CNNs with diverse but overlapped outputs are seamlessly combined to generate a mixture network with larger outputs for recognizing tens of thousands of atomic object classes. Our experimental results have demonstrated that our deep mixture of diverse experts algorithm can achieve very competitive results on large-scale visual recognition.

[1]  Geoffrey E. Hinton,et al.  Deep Mixtures of Factor Analysers , 2012, ICML.

[2]  Dekang Lin,et al.  An Information-Theoretic Definition of Similarity , 1998, ICML.

[3]  Jian Yang,et al.  Sparse Deep Stacking Network for Image Classification , 2015, AAAI.

[4]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[5]  Zhuowen Tu,et al.  Probabilistic boosting-tree: learning discriminative models for classification, recognition, and clustering , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[6]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Michael I. Jordan,et al.  Deep Transfer Learning with Joint Adaptation Networks , 2016, ICML.

[8]  Thomas G. Dietterich,et al.  Dictionary-free categorization of very similar objects via stacked evidence trees , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[9]  João Gama,et al.  Cascade Generalization , 2000, Machine Learning.

[10]  Samy Bengio,et al.  Large-Scale Object Classification Using Label Relation Graphs , 2014, ECCV.

[11]  Jianping Fan,et al.  Integrating Concept Ontology and Multitask Learning to Achieve More Effective Classifier Training for Multilevel Image Annotation , 2008, IEEE Transactions on Image Processing.

[12]  Eric P. Xing,et al.  Large-Scale Category Structure Aware Image Categorization , 2011, NIPS.

[13]  Luís A. Alexandre,et al.  Weighted Convolutional Neural Network Ensemble , 2014, CIARP.

[14]  Peter Kontschieder,et al.  Deep Neural Decision Forests , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[15]  Trevor Darrell,et al.  DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition , 2013, ICML.

[16]  Thomas G. Dietterich,et al.  Dictionary-free categorization of very similar objects via stacked evidence trees , 2009, CVPR.

[17]  Robinson Piramuthu,et al.  HD-CNN: Hierarchical Deep Convolutional Neural Network for Image Classification , 2014, ArXiv.

[18]  Cordelia Schmid,et al.  Semantic Hierarchies for Visual Object Recognition , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[19]  Alexei A. Efros,et al.  What makes ImageNet good for transfer learning? , 2016, ArXiv.

[20]  Fei-Fei Li,et al.  What Does Classifying More Than 10, 000 Image Categories Tell Us? , 2010, ECCV.

[21]  Barbara Caputo,et al.  Safety in numbers: Learning categories from few examples with multi model knowledge transfer , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[22]  Jonathan Krause,et al.  Hedging your bets: Optimizing accuracy-specificity trade-offs in large scale visual recognition , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[23]  Larry S. Davis,et al.  Birdlets: Subordinate categorization using volumetric primitives and pose-normalized appearance , 2011, 2011 International Conference on Computer Vision.

[24]  Jinhui Tang,et al.  Weakly-Shared Deep Transfer Networks for Heterogeneous-Domain Knowledge Propagation , 2015, ACM Multimedia.

[25]  Dahua Lin,et al.  PolyNet: A Pursuit of Structural Diversity in Very Deep Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Marc'Aurelio Ranzato,et al.  Learning Factored Representations in a Deep Mixture of Experts , 2013, ICLR.

[27]  David W. Conrath,et al.  Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy , 1997, ROCLING/IJCLCLP.

[28]  Luc Van Gool,et al.  The Pascal Visual Object Classes (VOC) Challenge , 2010, International Journal of Computer Vision.

[29]  Jian Sun,et al.  Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[30]  Ohad Shamir,et al.  Probabilistic Label Trees for Efficient Large Scale Image Classification , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[31]  Conrad Sanderson,et al.  Fine-grained classification via mixture of deep convolutional neural networks , 2016, 2016 IEEE Winter Conference on Applications of Computer Vision (WACV).

[32]  Zhuowen Tu,et al.  Generalizing Pooling Functions in Convolutional Neural Networks: Mixed, Gated, and Tree , 2015, AISTATS.

[33]  Pietro Perona,et al.  Learning and using taxonomies for fast visual categorization , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[34]  Andrew W. Fitzgibbon,et al.  Efficient Object Category Recognition Using Classemes , 2010, ECCV.

[35]  Jianping Fan,et al.  HD-MTL: Hierarchical Deep Multi-Task Learning for Large-Scale Visual Recognition , 2017, IEEE Transactions on Image Processing.

[36]  Jonathon Shlens,et al.  Fast, Accurate Detection of 100,000 Object Classes on a Single Machine , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[37]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[38]  Jonathan Krause,et al.  Fine-Grained Crowdsourcing for Fine-Grained Recognition , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[39]  Jitendra Malik,et al.  Cross Modal Distillation for Supervision Transfer , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Greg Mori,et al.  Learning Structured Inference Neural Networks with Label Relations , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Xiaoou Tang,et al.  Facial Landmark Detection by Deep Multi-task Learning , 2014, ECCV.

[42]  Qiang Chen,et al.  Network In Network , 2013, ICLR.

[43]  Philip Resnik,et al.  Using Information Content to Evaluate Semantic Similarity in a Taxonomy , 1995, IJCAI.

[44]  Donald Geman,et al.  Vantage Feature Frames for Fine-Grained Categorization , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[45]  Jian Yang,et al.  Boosted Convolutional Neural Networks , 2016, BMVC.

[46]  Yoshua Bengio,et al.  Deep Learning of Representations for Unsupervised and Transfer Learning , 2011, ICML Unsupervised and Transfer Learning.

[47]  Peter Kontschieder,et al.  Neural Decision Forests for Semantic Image Labelling , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[48]  Andrew Zisserman,et al.  Tabula rasa: Model transfer for object category detection , 2011, 2011 International Conference on Computer Vision.

[49]  Ming Yang,et al.  Large-scale image classification: Fast feature extraction and SVM training , 2011, CVPR 2011.

[50]  Bhiksha Raj,et al.  Unsupervised Fusion Weight Learning in Multiple Classifier Systems , 2015, ArXiv.

[51]  Ivan Laptev,et al.  Learning and Transferring Mid-level Image Representations Using Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[52]  Geoffrey E. Hinton,et al.  Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer , 2017, ICLR.

[53]  Hao Su,et al.  Object Bank: An Object-Level Image Representation for High-Level Visual Recognition , 2014, International Journal of Computer Vision.

[54]  Cordelia Schmid,et al.  Good Practice in Large-Scale Learning for Image Classification , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[55]  Jianping Fan,et al.  Hierarchical Learning of Tree Classifiers for Large-Scale Plant Species Identification , 2015, IEEE Transactions on Image Processing.

[56]  Sudha Ram,et al.  Constrained cascade generalization of decision trees , 2004, IEEE Transactions on Knowledge and Data Engineering.

[57]  Pietro Perona,et al.  Unsupervised learning of visual taxonomies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[58]  Rong Yan,et al.  Can High-Level Concepts Fill the Semantic Gap in Video Retrieval? A Case Study With Broadcast News , 2007, IEEE Transactions on Multimedia.

[59]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[60]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[61]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[62]  Dong Yu,et al.  Scalable stacking and learning for building deep architectures , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[63]  Nitish Srivastava,et al.  Discriminative Transfer Learning with Tree-based Priors , 2013, NIPS.

[64]  Alexander C. Berg,et al.  Fast and Balanced: Efficient Label Tree Learning for Large Scale Object Recognition , 2011, NIPS.

[65]  Yoshua Bengio,et al.  How transferable are features in deep neural networks? , 2014, NIPS.

[66]  Subhasis Das,et al.  SNN: Stacked Neural Networks , 2016, ArXiv.

[67]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[68]  John E. Hopcroft,et al.  Stacked Generative Adversarial Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[69]  Antoni B. Chan,et al.  Heterogeneous Multi-task Learning for Human Pose Estimation with Deep Convolutional Neural Network , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops.

[70]  Qiang Yang,et al.  A Survey on Transfer Learning , 2010, IEEE Transactions on Knowledge and Data Engineering.

[71]  Jason Weston,et al.  Label Embedding Trees for Large Multi-Class Tasks , 2010, NIPS.

[72]  Jürgen Schmidhuber,et al.  Multi-column deep neural networks for image classification , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.