Contextual Convolution Blocks

A fundamental processing layer of modern deep neural networks is the 2D convolution. It applies a filter uniformly across the input, effectively creating feature detectors that are translation invariant. In contrast, fully-connected layers are spatially selective, allowing unique detectors across the input. However, full connectivity comes at the ex-pense of an enormous number of free parameters to be trained, the associated difficulty in learning without over-fitting, and the loss of spatial coherence. We introduce Contextual Convolution Blocks, a novel method to create spatially selective feature detectors that are locally translation invariant. This increases the expressive power of the network beyond standard convolutional layers and allows learning unique filters for distinct regions of the input. The filters no longer need to be discriminative in regions not likely to contain the target features. This is a generalization of the Squeeze-and-Excitation architecture that introduces minimal extra parameters. We provide experimental results on three datasets and a thorough exploration into how the increased expressiveness is instantiated.

[1]  M. Lei,et al.  An Improved UNet++ Model for Congestive Heart Failure Diagnosis Using Short-Term RR Intervals , 2021, Diagnostics.

[2]  Mark Chen,et al.  Generative Pretraining From Pixels , 2020, ICML.

[3]  J. V. Gemert,et al.  On Translation Invariance in CNNs: Convolutional Layers Can Exploit Absolute Spatial Location , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Quoc V. Le,et al.  EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks , 2019, ICML.

[5]  Wojciech Matusik,et al.  Learning to Zoom: a Saliency-Based Sampling Layer for Neural Networks , 2018, ECCV.

[6]  Olga Veksler,et al.  Location Augmentation for CNN , 2018, ArXiv.

[7]  Hongbin Zha,et al.  Recurrent Squeeze-and-Excitation Context Aggregation Net for Single Image Deraining , 2018, ECCV.

[8]  Jason Yosinski,et al.  An Intriguing Failing of Convolutional Neural Networks and the CoordConv Solution , 2018, NeurIPS.

[9]  Yansong Feng,et al.  Graph2Seq: Graph to Sequence Learning with Attention-based Neural Networks , 2018, ArXiv.

[10]  Ying Tai,et al.  SESR: Single Image Super Resolution with Recursive Squeeze and Excitation Networks , 2018, 2018 24th International Conference on Pattern Recognition (ICPR).

[11]  Gang Sun,et al.  Squeeze-and-Excitation Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[12]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[13]  Xiaogang Wang,et al.  Residual Attention Network for Image Classification , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Tat-Seng Chua,et al.  SCA-CNN: Spatial and Channel-Wise Attention in Convolutional Networks for Image Captioning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  François Chollet,et al.  Xception: Deep Learning with Depthwise Separable Convolutions , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Diyi Yang,et al.  Hierarchical Attention Networks for Document Classification , 2016, NAACL.

[17]  Eugenio Culurciello,et al.  An Analysis of Deep Neural Network Models for Practical Applications , 2016, ArXiv.

[18]  Roberto Cipolla,et al.  Deep Roots: Improving CNN Efficiency with Hierarchical Filter Groups , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Jia Deng,et al.  Stacked Hourglass Networks for Human Pose Estimation , 2016, ECCV.

[20]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Wei Xu,et al.  Look and Think Twice: Capturing Top-Down Visual Attention with Feedback Convolutional Neural Networks , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[22]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Thomas Brox,et al.  U-Net: Convolutional Networks for Biomedical Image Segmentation , 2015, MICCAI.

[24]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[25]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[27]  Andrew Zisserman,et al.  Speeding up Convolutional Neural Networks with Low Rank Expansions , 2014, BMVC.

[28]  Christian Szegedy,et al.  DeepPose: Human Pose Estimation via Deep Neural Networks , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[29]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[30]  Yann LeCun,et al.  Convolutional networks and applications in vision , 2010, Proceedings of 2010 IEEE International Symposium on Circuits and Systems.

[31]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[32]  C. Koch,et al.  Computational modelling of visual attention , 2001, Nature Reviews Neuroscience.

[33]  Takeo Kanade,et al.  Neural network-based face detection , 1996, Proceedings CVPR IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[34]  Mr. D Murahari,et al.  DALL-E: CREATING IMAGES FROM TEXT , 2021 .

[35]  Lawrence D. Jackel,et al.  Handwritten Digit Recognition with a Back-Propagation Network , 1989, NIPS.