Learning Sparse and Invariant Features Hierarchies

Understanding how the visual cortex builds invariant representations is one of the most challenging problems in visual neuroscience. The feed-forward, multi-stage Hubel and Wiesel architecture [1, 2, 3, 4, 5] stacks multiple levels of alternating layers of simple cells that perform feature extraction, and complex cells that pool together features of a given type within a local receptive field. These computational models have been successfully applied to handwriting recognition [1, 2], and generic object recognition [4, 5]. Learning features in existing models consists in handcrafting the first layers and training the upper layers by recording templates from the training set, which leads to inefficient representations [4, 5], or in training the entire architecture supervised, which requires large training sets [2, 3]. We propose a fully unsupervised algorithm for learning sparse and locally invariant features at all levels. Each simple-cell layer is composed of multiple convolution filters followed by a winner-take-all competition within a local area, and a sigmoid non-linearity. For training, each simple-cell layer is coupled with a feed-back layer whose role is to reconstruct the input of the simple-cell layer from its output. These coupled layers are trained simultaneously to minimize the average reconstruction error. The output of a simple-cell layer can be seen as a sparse overcomplete representation of its input. The complex cells add the simple cell activities of one filter within the area over which the winner-take-all operation is performed, yielding representations that are invariant to small displacements of the input stimulus. The training procedure is similar to [6], but the local winner-take-all competition ensures that the representation is spatially sparse (and the complex-cell representation locally invariant). The next stage of simple-cell and complex-cell layers is trained in an identical fashion on the outputs of the first layer of complex cells [7], resulting in higher level, more invariant representations, that are then fed to a supervised classifier. Such a procedure yields 0.64% error on MNIST dataset (handwritten digits), and 54% average recognition rate on the Caltech-101 dataset (101 object categories, 30 training samples per category), demonstrating good performance even with few labeled training samples.

[1]  David G. Lowe,et al.  Multiclass Object Recognition with Sparse, Localized Features , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[2]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.