Heterogeneous Convolutional Neural Networks for Visual Recognition

Deep convolutional neural networks (CNNs) have shown impressive performance for image recognition when trained over large scale datasets such as ImageNet. CNNs can extract hierarchical features layer by layer starting from raw pixel values, and representations from the highest layers can be efficiently adapted to other visual recognition tasks. In this paper, we propose heterogeneous deep convolutional neural networks (HCNNs) to learn features from different CNN models. Features obtained from heterogeneous CNNs have different characteristics since each network has a different architecture with different depth and the design of receptive fields. HCNNs use a combination network (i.e. another multi-layer neural network) to learn higher level features combining those obtained from heterogeneous base neural networks. The combination network is also trained and thus can better integrate features obtained from heterogeneous base networks. To better understand the combination mechanism, we backpropagate the optimal output and evaluate how the network selects features from each model. The results show that the combination network can automatically leverage the different descriptive abilities of the original models, achieving comparable performance on many challenging benchmarks.

[1]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[2]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[3]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[4]  Andrew Zisserman,et al.  Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps , 2013, ICLR.

[5]  Lars Kai Hansen,et al.  Neural Network Ensembles , 1990, IEEE Trans. Pattern Anal. Mach. Intell..

[6]  Jürgen Schmidhuber,et al.  Multi-column deep neural networks for image classification , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[7]  Shuang Wang,et al.  INSTRE: A New Benchmark for Instance-Level Object Retrieval and Recognition , 2015, ACM Trans. Multim. Comput. Commun. Appl..

[8]  Bolei Zhou,et al.  Learning Deep Features for Scene Recognition using Places Database , 2014, NIPS.

[9]  Nuno Vasconcelos,et al.  Scene classification with semantic Fisher vectors , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[11]  Jitendra Malik,et al.  Analyzing the Performance of Multilayer Neural Networks for Object Recognition , 2014, ECCV.

[12]  Lawrence D. Jackel,et al.  Backpropagation Applied to Handwritten Zip Code Recognition , 1989, Neural Computation.

[13]  Andrew Zisserman,et al.  Return of the Devil in the Details: Delving Deep into Convolutional Nets , 2014, BMVC.

[14]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Antonio Torralba,et al.  Recognizing indoor scenes , 2009, CVPR.

[16]  Xiang Zhang,et al.  OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks , 2013, ICLR.

[17]  Yuan He,et al.  Cascaded heterogeneous convolutional neural networks for handwritten digit recognition , 2012, Proceedings of the 21st International Conference on Pattern Recognition (ICPR2012).

[18]  Yizhou Yu,et al.  Harvesting Discriminative Meta Objects with Deep CNN Features for Scene Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[19]  Krista A. Ehinger,et al.  SUN database: Large-scale scene recognition from abbey to zoo , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[20]  Yoshua Bengio,et al.  How transferable are features in deep neural networks? , 2014, NIPS.

[21]  Qiang Chen,et al.  Network In Network , 2013, ICLR.

[22]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[23]  Pascal Vincent,et al.  Visualizing Higher-Layer Features of a Deep Network , 2009 .

[24]  Jian Sun,et al.  Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[25]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.