Why do deep convolutional networks generalize so poorly to small image transformations?

Convolutional Neural Networks (CNNs) are commonly assumed to be invariant to small image transformations: either because of the convolutional architecture or because they were trained using data augmentation. Recently, several authors have shown that this is not the case: small translations or rescalings of the input image can drastically change the network's prediction. In this paper, we quantify this phenomena and ask why neither the convolutional architecture nor data augmentation are sufficient to achieve the desired invariance. Specifically, we show that the convolutional architecture does not give invariance since architectures ignore the classical sampling theorem, and data augmentation does not give invariance because the CNNs learn to be invariant to transformations only for images that are very similar to typical images from the training set. We discuss two possible solutions to this problem: (1) antialiasing the intermediate representations and (2) increasing data augmentation and show that they provide only a partial solution at best. Taken together, our results indicate that the problem of insuring invariance to small image transformations in neural networks while preserving high accuracy remains unsolved.

[1]  Richard Zhang,et al.  Making Convolutional Networks Shift-Invariant Again , 2019, ICML.

[2]  Thomas G. Dietterich,et al.  Benchmarking Neural Network Robustness to Common Corruptions and Perturbations , 2019, ICLR.

[3]  Ari S. Morcos,et al.  Learned Deformation Stability in Convolutional Neural Networks , 2018, ArXiv.

[4]  Thomas Wolf,et al.  Studying Invariances of Trained Convolutional Neural Networks , 2018, ArXiv.

[5]  Jascha Sohl-Dickstein,et al.  Adversarial Examples that Fool both Human and Computer Vision , 2018, ArXiv.

[6]  Alexei A. Efros,et al.  The Unreasonable Effectiveness of Deep Features as a Perceptual Metric , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[7]  Dawn Xiaodong Song,et al.  Exploring the Space of Black-box Attacks on Deep Neural Networks , 2017, ArXiv.

[8]  Paolo Papotti,et al.  Query-limited Black-box Attacks to Classifiers , 2017, ArXiv.

[9]  Eric Kauderer-Abrams,et al.  Quantifying Translation-Invariance in Convolutional Neural Networks , 2017, ArXiv.

[10]  Aleksander Madry,et al.  A Rotation and a Translation Suffice: Fooling CNNs with Simple Transformations , 2017, ArXiv.

[11]  Yoshua Bengio,et al.  Measuring the tendency of CNNs to Learn Surface Statistical Regularities , 2017, ArXiv.

[12]  Kouichi Sakurai,et al.  One Pixel Attack for Fooling Deep Neural Networks , 2017, IEEE Transactions on Evolutionary Computation.

[13]  Dawn Song,et al.  Robust Physical-World Attacks on Deep Learning Models , 2017, 1707.08945.

[14]  Logan Engstrom,et al.  Synthesizing Robust Adversarial Examples , 2017, ICML.

[15]  Thomas A. Funkhouser,et al.  Dilated Residual Networks , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Stephan J. Garbin,et al.  Harmonic Networks: Deep Translation and Rotation Equivariance , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Lujo Bauer,et al.  Accessorize to a Crime: Real and Stealthy Attacks on State-of-the-Art Face Recognition , 2016, CCS.

[18]  Robert B. Fisher,et al.  Fine-grained Recognition in the Noisy Wild: Sensitivity Analysis of Convolutional Neural Networks Approaches , 2016, BMVC.

[19]  Junwei Han,et al.  Learning Rotation-Invariant Convolutional Neural Networks for Object Detection in VHR Optical Remote Sensing Images , 2016, IEEE Transactions on Geoscience and Remote Sensing.

[20]  Kilian Q. Weinberger,et al.  Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Samy Bengio,et al.  Adversarial examples in the physical world , 2016, ICLR.

[22]  Iasonas Kokkinos,et al.  DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[23]  Gong Cheng,et al.  RIFD-CNN: Rotation-Invariant and Fisher Discriminative Convolutional Neural Networks for Object Detection , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  M. Welling,et al.  Group Equivariant Convolutional Networks , 2016, ICML.

[25]  Sergey Ioffe,et al.  Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning , 2016, AAAI.

[26]  Koray Kavukcuoglu,et al.  Exploiting Cyclic Symmetry in Convolutional Neural Networks , 2016, ICML.

[27]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Vladlen Koltun,et al.  Multi-Scale Context Aggregation by Dilated Convolutions , 2015, ICLR.

[29]  Eero P. Simoncelli,et al.  Geodesics of learned representations , 2015, ICLR.

[30]  Pascal Frossard,et al.  Manitest: Are classifiers really invariant? , 2015, BMVC.

[31]  Sander Dieleman,et al.  Rotation-invariant convolutional neural networks for galaxy morphology prediction , 2015, ArXiv.

[32]  Jian Sun,et al.  Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[33]  Max Welling,et al.  Transformation Properties of Learned Visual Representations , 2014, ICLR.

[34]  Iasonas Kokkinos,et al.  Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs , 2014, ICLR.

[35]  Bolei Zhou,et al.  Object Detectors Emerge in Deep Scene CNNs , 2014, ICLR.

[36]  Pedro M. Domingos,et al.  Deep Symmetry Networks , 2014, NIPS.

[37]  Jiaxing Zhang,et al.  Scale-Invariant Convolutional Neural Networks , 2014, ArXiv.

[38]  Andrea Vedaldi,et al.  Understanding Image Representations by Measuring Their Equivariance and Equivalence , 2014, International Journal of Computer Vision.

[39]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[41]  Joan Bruna,et al.  Intriguing properties of neural networks , 2013, ICLR.

[42]  Rob Fergus,et al.  Visualizing and Understanding Convolutional Networks , 2013, ECCV.

[43]  Stéphane Mallat,et al.  Rotation, Scaling and Deformation Invariant Scattering for Texture Discrimination , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[44]  Yair Weiss,et al.  Learning about Canonical Views from Internet Image Collections , 2012, NIPS.

[45]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[46]  Bastian Leibe,et al.  Discovering favorite views of popular places with iconoid shift , 2011, 2011 International Conference on Computer Vision.

[47]  Alexei A. Efros,et al.  Unbiased look at dataset bias , 2011, CVPR 2011.

[48]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[49]  Alexander C. Berg,et al.  Finding iconic images , 2009, 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[50]  Svetlana Lazebnik,et al.  Computing iconic summaries of general visual concepts , 2008, 2008 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[51]  Steven M. Seitz,et al.  Scene Summarization for Online Image Collections , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[52]  David G. Lowe,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004, International Journal of Computer Vision.

[53]  James L. Crowley,et al.  Local Scale Selection for Gaussian Based Description Techniques , 2000, ECCV.

[54]  David G. Lowe,et al.  Object recognition from local scale-invariant features , 1999, Proceedings of the Seventh IEEE International Conference on Computer Vision.

[55]  Edward H. Adelson,et al.  Shiftable multiscale transforms , 1992, IEEE Trans. Inf. Theory.

[56]  Lawrence D. Jackel,et al.  Backpropagation Applied to Handwritten Zip Code Recognition , 1989, Neural Computation.

[57]  WHY DO DEEP CONVOLUTIONAL NETWORKS GENER- ALIZE SO POORLY TO SMALL IMAGE TRANSFORMA- TIONS? , 2018 .

[58]  J. Dunning The elephant in the room. , 2013, European journal of cardio-thoracic surgery : official journal of the European Association for Cardio-thoracic Surgery.

[59]  T. Lindeberg,et al.  Scale-Space Theory : A Basic Tool for Analysing Structures at Different Scales , 1994 .

[60]  Kunihiko Fukushima,et al.  Neocognitron: A hierarchical neural network capable of visual pattern recognition , 1988, Neural Networks.

[61]  Kunihiko Fukushima,et al.  Neocognitron: A Self-Organizing Neural Network Model for a Mechanism of Visual Pattern Recognition , 1982 .