Convolutional Neural Networks have been considered the go-to option for object recognition in computer vision for the last
couple of years. However, their invariance to object’s translations
is still deemed as a weak point and remains limited to small
translations only via their max-pooling layers. One bio-inspired
approach considers the What/Where pathway separation in
Mammals to overcome this limitation. This approach works as a
nature-inspired attention mechanism, another classical approach
of which is Spatial Transformers. These allow an adaptive endto-end learning of different classes of spatial transformations
throughout training. In this work, we overview Spatial Transformers as an attention-only mechanism and compare them with
the What/Where model. We show that the use of attention restricted or “Foveated” Spatial Transformer Networks, coupled
alongside a curriculum learning training scheme and an efficient
log-polar visual space entry, provides better performance when
compared to the What/Where model, all this without the need
for any extra supervision whatsoever.