Top-Down Networks: A coarse-to-fine reimagination of CNNs

Biological vision adopts a coarse-to-fine information processing pathway, from initial visual detection and binding of salient features of a visual scene, to the enhanced and preferential processing given relevant stimuli. On the contrary, CNNs employ a fine-to-coarse processing, moving from local, edge-detecting filters to more global ones extracting abstract representations of the input. In this paper we reverse the feature extraction part of standard bottom-up architectures and turn them upside-down: We propose top-down networks. Our proposed coarse-to-fine pathway, by blurring higher frequency information and restoring it only at later stages, offers a line of defence against adversarial attacks that introduce high frequency noise. Moreover, since we increase image resolution with depth, the high resolution of the feature map in the final convolutional layer contributes to the explainability of the network’s decision making process. This favors object-driven decisions over context driven ones, and thus provides better localized class activation maps. This paper offers empirical evidence for the applicability of the top-down resolution processing to various existing architectures on multiple visual tasks.

[1]  Stella X. Yu,et al.  Multigrid Neural Architectures , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Zhe L. Lin,et al.  Top-Down Neural Attention by Excitation Backprop , 2016, International Journal of Computer Vision.

[3]  Jiaxing Zhang,et al.  Scale-Invariant Convolutional Neural Networks , 2014, ArXiv.

[4]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[5]  Rob Fergus,et al.  Depth Map Prediction from a Single Image using a Multi-Scale Deep Network , 2014, NIPS.

[6]  S. Palmer,et al.  A century of Gestalt psychology in visual perception: I. Perceptual grouping and figure-ground organization. , 2012, Psychological bulletin.

[7]  Ivan Sosnovik,et al.  Scale-Equivariant Steerable Networks , 2020, ICLR.

[8]  Donald Geman,et al.  Coarse-to-Fine Face Detection , 2004, International Journal of Computer Vision.

[9]  Thomas Brox,et al.  U-Net: Convolutional Networks for Biomedical Image Segmentation , 2015, MICCAI.

[10]  Ding Liu,et al.  Scale-wise Convolution for Image Restoration , 2019, AAAI.

[11]  Quanfu Fan,et al.  More Is Less: Learning Efficient Video Representations by Big-Little Network and Depthwise Temporal Aggregation , 2019, NeurIPS.

[12]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Jun Yu,et al.  A Coarse-to-Fine Face Hallucination Method by Exploiting Facial Prior Knowledge , 2018, 2018 25th IEEE International Conference on Image Processing (ICIP).

[14]  Yiannis Aloimonos,et al.  Evenly Cascaded Convolutional Networks , 2018, 2018 IEEE International Conference on Big Data (Big Data).

[15]  Aleksander Madry,et al.  Exploring the Landscape of Spatial Robustness , 2017, ICML.

[16]  Shubhra Aich,et al.  Multi-Scale Weight Sharing Network for Image Recognition , 2020, Pattern Recognit. Lett..

[17]  Matthias Bethge,et al.  ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness , 2018, ICLR.

[18]  Bolei Zhou,et al.  Learning Deep Features for Discriminative Localization , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Jian Sun,et al.  Identity Mappings in Deep Residual Networks , 2016, ECCV.

[20]  Trevor Darrell,et al.  Blurring the Line Between Structure and Learning to Optimize and Adapt Receptive Fields , 2019, ArXiv.

[21]  Jonathon Shlens,et al.  Explaining and Harnessing Adversarial Examples , 2014, ICLR.

[22]  Xin Li,et al.  Adversarial Examples Detection in Deep Networks with Convolutional Filter Statistics , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[23]  Shiguang Shan,et al.  Coarse-to-Fine Auto-Encoder Networks (CFAN) for Real-Time Face Alignment , 2014, ECCV.

[24]  Zhaoping Li,et al.  Neural Activities in V1 Create a Bottom-Up Saliency Map , 2012, Neuron.

[25]  Takeo Kanade,et al.  An Iterative Image Registration Technique with an Application to Stereo Vision , 1981, IJCAI.

[26]  Richard Zhang,et al.  Making Convolutional Networks Shift-Invariant Again , 2019, ICML.

[27]  Yoshua Bengio,et al.  Measuring the tendency of CNNs to Learn Surface Statistical Regularities , 2017, ArXiv.

[28]  I Kovács,et al.  A closed curve is much more than an incomplete one: effect of closure in figure-ground segmentation. , 1993, Proceedings of the National Academy of Sciences of the United States of America.

[29]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[30]  Yunsong Li,et al.  Efficient Coarse-to-Fine Patch Match for Large Displacement Optical Flow , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Hichem Sahbi,et al.  Coarse-to-Fine Deep Kernel Networks , 2017, 2017 IEEE International Conference on Computer Vision Workshops (ICCVW).

[32]  Jaakko Lehtinen,et al.  Progressive Growing of GANs for Improved Quality, Stability, and Variation , 2017, ICLR.

[33]  Xiaolong Liang,et al.  Adversarial Defense by Suppressing High-frequency Components , 2019, ArXiv.

[34]  Abhishek Das,et al.  Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[35]  Matthias Bethge,et al.  Towards the first adversarially robust neural network model on MNIST , 2018, ICLR.

[36]  Kaiming He,et al.  Feature Pyramid Networks for Object Detection , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[38]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[39]  Matthew Crosby,et al.  Association for the Advancement of Artificial Intelligence , 2014 .

[40]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  W. Brendel,et al.  Foolbox: A Python toolbox to benchmark the robustness of machine learning models , 2017 .

[42]  Jian Sun,et al.  Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[43]  Joan Bruna,et al.  Intriguing properties of neural networks , 2013, ICLR.

[44]  J. Hegdé Time course of visual perception: Coarse-to-fine processing and beyond , 2008, Progress in Neurobiology.

[45]  David J. Field,et al.  Contour integration by the human visual system: Evidence for a local “association field” , 1993, Vision Research.

[46]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[47]  Chen Chen,et al.  A closer look at network resolution for efficient network design , 2019, ArXiv.

[48]  Eric P. Xing,et al.  High-Frequency Component Helps Explain the Generalization of Convolutional Neural Networks , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[49]  Sina Honari,et al.  Recombinator Networks: Learning Coarse-to-Fine Feature Aggregation , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[50]  Mikko H. Lipasti,et al.  BlurNet: Defense by Filtering the Feature Maps , 2019, 2020 50th Annual IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN-W).

[51]  Xiaofeng Wang,et al.  Detecting Adversarial Image Examples in Deep Neural Networks with Adaptive Noise Reduction , 2017, IEEE Transactions on Dependable and Secure Computing.

[52]  Larry S. Davis,et al.  A Coarse-to-Fine Framework for Resource Efficient Video Recognition , 2019, International Journal of Computer Vision.

[53]  Matthias Bethge,et al.  Foolbox v0.8.0: A Python toolbox to benchmark the robustness of machine learning models , 2017, ArXiv.

[54]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.