Convolutional Neural Networks Are Not Invariant to Translation, but They Can Learn to Be

When seeing a new object, humans can immediately recognize it across different retinal locations: the internal object representation is invariant to translation. It is commonly believed that Convolutional Neural Networks (CNNs) are architecturally invariant to translation thanks to the convolution and/or pooling operations they are endowed with. In fact, several studies have found that these networks systematically fail to recognise new objects on untrained locations. In this work, we test a wide variety of CNNs architectures showing how, apart from DenseNet-121, none of the models tested was architecturally invariant to translation. Nevertheless, all of them could learn to be invariant to translation. We show how this can be achieved by pretraining on ImageNet, and it is sometimes possible with much simpler data sets when all the items are fully translated across the input canvas. At the same time, this invariance can be disrupted by further training due to catastrophic forgetting/interference. These experiments show how pretraining a network on an environment with the right ‘latent’ characteristics (a more naturalistic environment) can result in the network learning deep perceptual rules which would dramatically improve subsequent generalization.

[1]  Michele Volpi,et al.  Learning rotation invariant convolutional filters for texture classification , 2016, 2016 23rd International Conference on Pattern Recognition (ICPR).

[2]  Svetlana Lazebnik,et al.  Multi-scale Orderless Pooling of Deep Convolutional Activation Features , 2014, ECCV.

[3]  Gihyun Kwon,et al.  Representation of white- and black-box adversarial examples in deep neural networks and humans: A functional magnetic resonance imaging study , 2019, 2019 International Joint Conference on Neural Networks (IJCNN).

[4]  Ganesh Sundaramoorthi,et al.  Translation Insensitive CNNs , 2019, ArXiv.

[5]  Wei Ji Ma,et al.  A neural network walks into a lab: towards using deep nets as models for human behavior , 2020, ArXiv.

[6]  Yair Weiss,et al.  Why do deep convolutional networks generalize so poorly to small image transformations? , 2018, J. Mach. Learn. Res..

[7]  Trevor Darrell,et al.  Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[8]  Pedro M. Domingos,et al.  Deep Symmetry Networks , 2014, NIPS.

[9]  Keiji Tanaka,et al.  Inferotemporal cortex and object vision. , 1996, Annual review of neuroscience.

[10]  Gary Marcus,et al.  Deep Learning: A Critical Appraisal , 2018, ArXiv.

[11]  Andrew Zisserman,et al.  Spatial Transformer Networks , 2015, NIPS.

[12]  Richard Zhang,et al.  Making Convolutional Networks Shift-Invariant Again , 2019, ICML.

[13]  Unsupervised neural network models of the ventral visual stream , 2021, Proceedings of the National Academy of Sciences.

[14]  Jeffrey S. Bowers,et al.  The human visual system and CNNs can both support robust online translation tolerance following extreme displacements. , 2020 .

[15]  Roland Vollgraf,et al.  Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms , 2017, ArXiv.

[16]  Luca Maria Gambardella,et al.  Deep, Big, Simple Neural Nets for Handwritten Digit Recognition , 2010, Neural Computation.

[17]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[18]  Gaurav Malhotra,et al.  The role of Disentanglement in Generalisation , 2021, ICLR.

[19]  Matthias Bethge,et al.  Generalisation in humans and deep neural networks , 2018, NeurIPS.

[20]  Robert H. Logie,et al.  The Oxford Handbook of Cognitive Psychology , 2013 .

[21]  G W Humphreys,et al.  Varieties of Object Constancy , 1989, The Quarterly journal of experimental psychology. A, Human experimental psychology.

[22]  Michael C. Frank,et al.  Unsupervised neural network models of the ventral visual stream , 2020, Proceedings of the National Academy of Sciences.

[23]  Joel Lehman,et al.  Learning to Continually Learn , 2020, ECAI.

[24]  Yoshua Bengio,et al.  Convolutional networks for images, speech, and time series , 1998 .

[25]  Michael McCloskey,et al.  Catastrophic Interference in Connectionist Networks: The Sequential Learning Problem , 1989 .

[26]  Kunihiko Fukushima,et al.  Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position , 1980, Biological Cybernetics.

[27]  Jiaxing Zhang,et al.  Scale-Invariant Convolutional Neural Networks , 2014, ArXiv.

[28]  Matthias Bethge,et al.  The Notorious Difficulty of Comparing Human and Machine Perception , 2020, 2019 Conference on Cognitive Computational Neuroscience.

[29]  Valerio Biscione,et al.  A case for robust translation tolerance in humans and CNNs. A commentary on Han et al. , 2020 .

[30]  I. Biederman,et al.  Evidence for Complete Translational and Reflectional Invariance in Visual Object Priming , 1991, Perception.

[31]  Laurent Itti,et al.  Active Long Term Memory Networks , 2016, ArXiv.

[32]  Irving Biederman,et al.  Invariance of long-term visual priming to scale, reflection, translation, and hemisphere , 2001, Vision Research.

[33]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[34]  D. Reisberg The Oxford Handbook of Cognitive Psychology , 2013 .

[35]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Casimir J. H. Ludwig,et al.  The visual system supports online translation invariance for object identification , 2015, Psychonomic Bulletin & Review.

[37]  Jason Weston,et al.  Curriculum learning , 2009, ICML '09.

[38]  I Biederman,et al.  Metric invariance in object recognition: a review and further evidence. , 1992, Canadian journal of psychology.

[39]  Lawrence D. Jackel,et al.  Handwritten Digit Recognition with a Back-Propagation Network , 1989, NIPS.

[40]  O. Reiser,et al.  Principles Of Gestalt Psychology , 1936 .

[41]  Yann LeCun,et al.  Generalization and network design strategies , 1989 .

[42]  Tomaso A. Poggio,et al.  Eccentricity Dependent Deep Neural Networks: Modeling Invariance in Human Vision , 2017, AAAI Spring Symposia.

[43]  Gemma Roig,et al.  Scale and translation-invariance for novel objects in human vision , 2020, Scientific Reports.

[44]  Xavier Boix,et al.  Minimal Images in Deep Neural Networks: Fragile Object Recognition in Natural Images , 2019, ICLR.

[45]  Kilian Q. Weinberger,et al.  Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[46]  Martha White,et al.  Meta-Learning Representations for Continual Learning , 2019, NeurIPS.

[47]  Alex Lamb,et al.  Deep Learning for Classical Japanese Literature , 2018, ArXiv.

[48]  Gregory Cohen,et al.  EMNIST: an extension of MNIST to handwritten letters , 2017, CVPR 2017.

[49]  Jeffrey S. Bowers,et al.  Extreme on-line translation tolerance in human and machine object recognition , 2020, 2009.12855.

[50]  Nikolaus Kriegeskorte,et al.  Deep neural networks: a new framework for modelling biological vision and brain information processing , 2015, bioRxiv.

[51]  Zoe J. Oliver,et al.  Early differential sensitivity of evoked-potentials to local and global shape during the perception of three-dimensional objects , 2016, Neuropsychologia.

[52]  Derek Hoiem,et al.  Learning without Forgetting , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[53]  Eric Kauderer-Abrams,et al.  Quantifying Translation-Invariance in Convolutional Neural Networks , 2017, ArXiv.

[54]  Hongjing Lu,et al.  Deep convolutional networks do not classify based on global object shape , 2018, PLoS Comput. Biol..

[55]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[56]  J. V. Gemert,et al.  On Translation Invariance in CNNs: Convolutional Layers Can Exploit Absolute Spatial Location , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[57]  Ali Kashif Bashir,et al.  Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) , 2013, ICIRA 2013.

[58]  Surya Ganguli,et al.  A deep learning framework for neuroscience , 2019, Nature Neuroscience.

[59]  I. Biederman,et al.  Dynamic binding in a neural network for shape recognition. , 1992, Psychological review.

[60]  Andrea Vedaldi,et al.  Understanding Image Representations by Measuring Their Equivariance and Equivalence , 2014, International Journal of Computer Vision.

[61]  Hidetoshi Furukawa,et al.  Deep Learning for Target Classification from SAR Imagery: Data Augmentation and Translation Invariance , 2017, ArXiv.

[62]  Stéphane Mallat,et al.  Invariant Scattering Convolution Networks , 2012, IEEE transactions on pattern analysis and machine intelligence.

[63]  R. O’Reilly,et al.  Computational Explorations in Cognitive Neuroscience , 2009 .