Transforming Autoencoders

One way to design an object recognition system is to define objects recursively in terms of their parts and the required spatial relationships between the parts and the whole. A natural way for a neural network to implement this knowledge is by using a matrix of weights to represent each part-whole relationship and a vector of neural activities to represent the pose of each part or whole relative to the viewer [10]. This leads to neural networks that can recognize objects over a wide range of viewpoints using neural activities that are “equivariant” rather than invariant: as the viewpoint varies the neural activities all vary even though the knowledge in the weights is viewpoint-invariant. The “capsules” that implement the lowest-level parts in the shape hierarchy need to extract explicit pose parameters from pixel intensities. This paper shows that these capsules are quite easy to learn from pairs of transformed images if the neural net has direct, non-visual access to the transformations.