Global Second-Order Pooling Convolutional Networks

Deep Convolutional Networks (ConvNets) are fundamental to, besides large-scale visual recognition, a lot of vision tasks. As the primary goal of the ConvNets is to characterize complex boundaries of thousands of classes in a high-dimensional space, it is critical to learn higher-order representations for enhancing non-linear modeling capability. Recently, Global Second-order Pooling (GSoP), plugged at the end of networks, has attracted increasing attentions, achieving much better performance than classical, first-order networks in a variety of vision tasks. However, how to effectively introduce higher-order representation in earlier layers for improving non-linear capability of ConvNets is still an open problem. In this paper, we propose a novel network model introducing GSoP across from lower to higher layers for exploiting holistic image information throughout a network. Given an input 3D tensor outputted by some previous convolutional layer, we perform GSoP to obtain a covariance matrix which, after nonlinear transformation, is used for tensor scaling along channel dimension. Similarly, we can perform GSoP along spatial dimension for tensor scaling as well. In this way, we can make full use of the second-order statistics of the holistic image throughout a network. The proposed networks are thoroughly evaluated on large-scale ImageNet-1K, and experiments have shown that they outperform non-trivially the counterparts while achieving state-of-the-art results.

[1]  Joe Yue-Hei Ng,et al.  FASON: First and Second Order Information Fusion Network for Texture Recognition , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Guosheng Lin,et al.  MoNet: Deep Motion Exploitation for Video Object Segmentation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[3]  Jian Sun,et al.  Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[4]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Qilong Wang,et al.  Is Second-Order Information Helpful for Large-Scale Visual Recognition? , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[6]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[7]  Petros Daras,et al.  Non-linear Convolution Filters for CNN-Based Learning , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[8]  Yang Gao,et al.  Compact Bilinear Pooling , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Qilong Wang,et al.  Towards Faster Training of Global Covariance Pooling Networks by Iterative Matrix Square Root Normalization , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[10]  Xiao Liu,et al.  Kernel Pooling for Convolutional Neural Networks , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Yan Wang,et al.  SORT: Second-Order Response Transform for Visual Recognition , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[12]  In-So Kweon,et al.  CBAM: Convolutional Block Attention Module , 2018, ECCV.

[13]  Abhinav Gupta,et al.  Non-local Neural Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[14]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[15]  Lei Zhang,et al.  G2DeNet: Global Gaussian Distribution Embedding Network and Its Application to Visual Recognition , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Zhuowen Tu,et al.  Aggregated Residual Transformations for Deep Neural Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Gang Sun,et al.  Gather-Excite: Exploiting Feature Context in Convolutional Neural Networks , 2018, NeurIPS.

[18]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[19]  Jiaying Liu,et al.  Factorized Bilinear Models for Image Recognition , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[20]  Hao Wang,et al.  Multi-scale Location-Aware Kernel Representation for Object Detection , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[21]  Philip S. Yu,et al.  Spatiotemporal Pyramid Network for Video Action Recognition , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Ali Farhadi,et al.  XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks , 2016, ECCV.

[23]  Gang Sun,et al.  Squeeze-and-Excitation Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[24]  Jian Sun,et al.  Identity Mappings in Deep Residual Networks , 2016, ECCV.

[25]  Lei Zhang,et al.  Higher-Order Integration of Hierarchical Convolutional Activations for Fine-Grained Visual Categorization , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[26]  Ali Farhadi,et al.  YOLO9000: Better, Faster, Stronger , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Kilian Q. Weinberger,et al.  Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Guihua Wen,et al.  Competitive Inner-Imaging Squeeze and Excitation for Residual Network , 2018, ArXiv.

[29]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Quoc V. Le,et al.  DropBlock: A regularization method for convolutional networks , 2018, NeurIPS.

[31]  Thomas A. Funkhouser,et al.  Dilated Residual Networks , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Song Han,et al.  Learning both Weights and Connections for Efficient Neural Network , 2015, NIPS.

[33]  Cristian Sminchisescu,et al.  Semantic Segmentation with Second-Order Pooling , 2012, ECCV.

[34]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[35]  Joan Bruna,et al.  Exploiting Linear Structure Within Convolutional Networks for Efficient Evaluation , 2014, NIPS.

[36]  Qiang Chen,et al.  Network In Network , 2013, ICLR.

[37]  Subhransu Maji,et al.  Bilinear CNN Models for Fine-Grained Visual Recognition , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[38]  Lei Zhang,et al.  Local Log-Euclidean Multivariate Gaussian Descriptor and Its Application to Image Classification. , 2017, IEEE transactions on pattern analysis and machine intelligence.

[39]  Cristian Sminchisescu,et al.  Matrix Backpropagation for Deep Networks with Structured Layers , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[40]  Rob Fergus,et al.  Visualizing and Understanding Convolutional Networks , 2013, ECCV.

[41]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .