How to Represent Part-Whole Hierarchies in a Neural Network

This paper does not describe a working system. Instead, it presents a single idea about representation which allows advances made by several different groups to be combined into an imaginary system called GLOM. The advances include transformers, neural fields, contrastive representation learning, distillation and capsules. GLOM answers the question: How can a neural network with a fixed architecture parse an image into a partwhole hierarchy which has a different structure for each image? The idea is simply to use islands of identical vectors to represent the nodes in the parse tree. If GLOM can be made to work, it should significantly improve the interpretability of the representations produced by transformer-like systems when applied to vision or language. 1 Overview of the idea There is strong psychological evidence that people parse visual scenes into partwhole hierarchies and model the viewpoint-invariant spatial relationship between a part and a whole as the coordinate transformation between intrinsic coordinate frames that they assign to the part and the whole [Hinton, 1979]. If we want to make neural networks that understand images in the same way as people do, we need to figure out how neural networks can represent part-whole 1GLOM is derived from the slang ”glom together” which may derive from the word ”agglomerate”. 1 ar X iv :2 10 2. 12 62 7v 1 [ cs .C V ] 2 5 Fe b 20 21 hierarchies. This is difficult because a real neural network cannot dynamically allocate a group of neurons to represent a node in a parse tree. The inability of neural nets to dynamically allocate neurons was the motivation for a series of models that used “capsules” [Sabour et al., 2017, Hinton et al., 2018, Kosiorek et al., 2019]. These models made the assumption that a group of neurons called a capsule would be permanently dedicated to a part of a particular type occurring in a particular region of the image. A parse tree could then be created by activating a subset of these pre-existing, type-specific capsules and the appropriate connections between them. This paper describes a very different way of using capsules to represent the part-whole hierarchy in a neural net. Even though this paper is primarily concerned with the perception of a single static image, GLOM is most easily understood as a pipeline for processing a sequence of frames, so a static image will be treated as a sequence of identical frames. The GLOM architecture is composed of a large number of columns which all use exactly the same weights. Each column is a stack of spatially local autoencoders that learn multiple levels of representation for what is happening in a small image patch. Each autoencoder transforms the embedding at one level into the embedding at an adjacent level using a multilayer bottom-up encoder and a multilayer top-down decoder. These levels correspond to the levels in a part-whole hierarchy. When shown an image of a face, for example, a single column might converge on embedding vectors representing a nostril, a nose, a face, and a person. Figure 1 shows how the embeddings at different levels interact in a single column. Figure 1 does not show the interactions between embeddings at the same level in different columns. These are much simpler than the interactions within a column because they do not need to implement part-whole coordinate transforms. They are like the attention-weighted interactions between columns representing different word fragments in a multi-headed transformer [Devlin et al., 2018], but they are simpler because the query, key and value vectors are all identical to the embedding vector. The role of the inter-column interactions is to produce islands of identical embeddings at a level by making each embedding vector at that level regress towards other similar vectors at nearby locations. This creates multiple local ”echo chambers” in which embeddings at a level attend mainly to other like-minded embeddings. 2What neurons do is determined by their incoming and outgoing weights and real neurons cannot completely change these weights rapidly. 3The GLOM architecture has some similarity to models that use the errors in top-down predictions as their bottom-up signals [Rao and Ballard, 1999], but in a nonlinear system the bottom-up signals cannot just carry the prediction error because the full activity vector is required to select the right operating regime for the non-linear units. 4Each level in a column bears some resemblance to a hypercolumn as described by neuroscientists. 5An embedding vector is the activity vector of a capsule.

[1]  Kaiming He,et al.  Momentum Contrast for Unsupervised Visual Representation Learning , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Soroush Abbasi Koohpayegani,et al.  ISD: Self-Supervised Learning by Iterative Similarity Distillation , 2020, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[3]  Gordon Wetzstein,et al.  Implicit Neural Representations with Periodic Activation Functions , 2020, NeurIPS.

[4]  Nitish Srivastava,et al.  Geometric Capsule Autoencoders for 3D Point Clouds , 2019, ArXiv.

[5]  Geoffrey E. Hinton A Parallel Computation that Assigns Canonical Object-Based Frames of Reference , 1981, IJCAI.

[6]  Gordon Wetzstein,et al.  Scene Representation Networks: Continuous 3D-Structure-Aware Neural Scene Representations , 2019, NeurIPS.

[7]  Rich Caruana,et al.  Model compression , 2006, KDD '06.

[8]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[9]  Geoffrey E. Hinton Shape Representation in Parallel Systems , 1981, IJCAI.

[10]  Geoffrey E. Hinton,et al.  Using Fast Weights to Attend to the Recent Past , 2016, NIPS.

[11]  Geoffrey E. Hinton,et al.  Grammar as a Foreign Language , 2014, NIPS.

[12]  Francis Crick,et al.  The function of dream sleep , 1983, Nature.

[13]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[14]  Geoffrey E. Hinton,et al.  Dynamic Routing Between Capsules , 2017, NIPS.

[15]  J. Csicsvari,et al.  Replay and Time Compression of Recurring Spike Sequences in the Hippocampus , 1999, The Journal of Neuroscience.

[16]  Richard S. Zemel,et al.  Lending direction to neural networks , 1995, Neural Networks.

[17]  Geoffrey E. Hinton,et al.  Matrix capsules with EM routing , 2018, ICLR.

[18]  Geoffrey E. Hinton Mapping Part-Whole Hierarchies into Connectionist Networks , 1990, Artif. Intell..

[19]  Geoffrey E. Hinton,et al.  SMEM Algorithm for Mixture Models , 1998, Neural Computation.

[20]  Li Fei-Fei,et al.  Learning Physical Graph Representations from Visual Scenes , 2020, NeurIPS.

[21]  Mohammad Norouzi,et al.  Big Self-Supervised Models are Strong Semi-Supervised Learners , 2020, NeurIPS.

[22]  Yee Whye Teh,et al.  Stacked Capsule Autoencoders , 2019, NeurIPS.

[23]  Adam Santoro,et al.  Backpropagation and the brain , 2020, Nature Reviews Neuroscience.

[24]  Geoffrey E. Hinton,et al.  Learning Mixture Models of Spatial Coherence , 1993, Neural Computation.

[25]  K. I. WilliamsDivision,et al.  Products of Gaussians and Probabilistic Minor Component Analysis , 2002, Neural Computation.

[26]  Xinlei Chen,et al.  Exploring Simple Siamese Representation Learning , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Geoffrey E. Hinton,et al.  To appear in : Advances in Neural Information Processing Systems , 2007 .

[28]  Geoffrey E. Hinton,et al.  A View of the Em Algorithm that Justifies Incremental, Sparse, and other Variants , 1998, Learning in Graphical Models.

[29]  Miguel Á. Carreira-Perpiñán,et al.  Multiscale conditional random fields for image labeling , 2004, Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004..

[30]  Geoffrey E. Hinton,et al.  Learning and relearning in Boltzmann machines , 1986 .

[31]  Michal Valko,et al.  Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning , 2020, NeurIPS.

[32]  Geoffrey E. Hinton,et al.  A Mobile Robot That Learns Its Place , 1997, Neural Computation.

[33]  Oriol Vinyals,et al.  Representation Learning with Contrastive Predictive Coding , 2018, ArXiv.

[34]  Song-Chun Zhu,et al.  Mapping Natural Image Patches by Explicit and Implicit Manifolds , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[35]  Yee Whye Teh,et al.  Set Transformer , 2018, ICML.

[36]  Geoffrey E. Hinton,et al.  A Simple Framework for Contrastive Learning of Visual Representations , 2020, ICML.

[37]  Saeed Saremi,et al.  Hierarchical model of natural images and the origin of scale invariance , 2013, Proceedings of the National Academy of Sciences.

[38]  Geoffrey E. Hinton,et al.  Learning Distributed Representations of Concepts Using Linear Relational Embedding , 2001, IEEE Trans. Knowl. Data Eng..

[39]  David J. Fleet,et al.  Unsupervised part representation by Flow Capsules , 2020, ICML.

[40]  Donald Geman,et al.  Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images , 1984, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[41]  R Devon Hjelm,et al.  Learning Representations by Maximizing Mutual Information Across Views , 2019, NeurIPS.

[42]  Geoffrey E. Hinton Training Products of Experts by Minimizing Contrastive Divergence , 2002, Neural Computation.

[43]  Georg Heigold,et al.  Object-Centric Learning with Slot Attention , 2020, NeurIPS.

[44]  Paul A. Viola,et al.  Robust Real-Time Face Detection , 2001, International Journal of Computer Vision.

[45]  Albert K. Lee,et al.  Memory of Sequential Experience in the Hippocampus during Slow Wave Sleep , 2002, Neuron.

[46]  Pratul P. Srinivasan,et al.  NeRF , 2020, ECCV.

[47]  Geoffrey E. Hinton,et al.  Modeling image patches with a directed hierarchy of Markov random fields , 2007, NIPS.

[48]  Andrea Tagliasacchi,et al.  NASA: Neural Articulated Shape Approximation , 2020, ECCV.

[49]  Andreas Geiger,et al.  GIRAFFE: Representing Scenes as Compositional Generative Neural Feature Fields , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[50]  Geoffrey E. Hinton,et al.  Canonical Capsules: Unsupervised Capsules in Canonical Pose , 2020, ArXiv.

[51]  Geoffrey E. Hinton,et al.  Modeling Human Motion Using Binary Latent Variables , 2006, NIPS.

[52]  Geoffrey E. Hinton,et al.  Transforming Auto-Encoders , 2011, ICANN.

[53]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[54]  David Ha,et al.  Generating Large Images from Latent Vectors , 2016 .

[55]  Paul Barham,et al.  Machine Learning Systems are Stuck in a Rut , 2019, HotOS.

[56]  Georg Heigold,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2021, ICLR.

[57]  Geoffrey E. Hinton,et al.  Implementing Semantic Networks in Parallel Hardware , 2014 .

[58]  Weiwei Sun,et al.  Attentive Context Normalization for Robust Permutation-Equivariant Learning , 2019, ArXiv.

[59]  Allan Jabri,et al.  Space-Time Correspondence as a Contrastive Random Walk , 2020, NeurIPS.

[60]  Geoffrey E. Hinton Some Demonstrations of the Effects of Structural Descriptions in Mental Imagery , 1979, Cogn. Sci..

[61]  Rajesh P. N. Rao,et al.  Predictive coding in the visual cortex: a functional interpretation of some extra-classical receptive-field effects. , 1999 .