Learning Adaptive Receptive Fields for Deep Image Parsing Network

In this paper, we introduce a novel approach to regulate receptive field in deep image parsing network automatically. Unlike previous works which have stressed much importance on obtaining better receptive fields using manually selected dilated convolutional kernels, our approach uses two affine transformation layers in the networks backbone and operates on feature maps. Feature maps will be inflated/shrinked by the new layer and therefore receptive fields in following layers are changed accordingly. By end-to-end training, the whole framework is data-driven without laborious manual intervention. The proposed method is generic across dataset and different tasks. We conduct extensive experiments on both general parsing task and face parsing task as concrete examples to demonstrate the methods superior regulation ability over manual designs.

[1]  Trevor Darrell,et al.  Do Convnets Learn Correspondence? , 2014, NIPS.

[2]  Vladlen Koltun,et al.  Multi-Scale Context Aggregation by Dilated Convolutions , 2015, ICLR.

[3]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[4]  Gang Hua,et al.  Supervised Transformer Network for Efficient Face Detection , 2016, ECCV.

[5]  Alexei A. Efros,et al.  Colorful Image Colorization , 2016, ECCV.

[6]  Zhe L. Lin,et al.  Exemplar-Based Face Parsing , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[7]  Iasonas Kokkinos,et al.  Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs , 2014, ICLR.

[8]  Antonio Torralba,et al.  Nonparametric Scene Parsing via Label Transfer , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9]  Hironobu Fujiyoshi,et al.  Cost-alleviative Learning for Deep Convolutional Neural Network-based Facial Part Labeling , 2015, IPSJ Trans. Comput. Vis. Appl..

[10]  Gregory Shakhnarovich,et al.  Feedforward semantic segmentation with zoom-out features , 2014, CVPR.

[11]  Vibhav Vineet,et al.  Conditional Random Fields as Recurrent Neural Networks , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[12]  Iasonas Kokkinos,et al.  DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[13]  Thomas S. Huang,et al.  Interactive Facial Feature Localization , 2012, ECCV.

[14]  Ming-Hsuan Yang,et al.  Multi-objective convolutional learning for face labeling , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Seunghoon Hong,et al.  Learning Deconvolution Network for Semantic Segmentation , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[16]  Xiaogang Wang,et al.  Deep Convolutional Network Cascade for Facial Point Detection , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[17]  Jitendra Malik,et al.  Simultaneous Detection and Segmentation , 2014, ECCV.

[18]  Andrew Zisserman,et al.  Spatial Transformer Networks , 2015, NIPS.

[19]  Luc Van Gool,et al.  The Pascal Visual Object Classes (VOC) Challenge , 2010, International Journal of Computer Vision.

[20]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.