论文信息 - Learning Sparse and Invariant Features Hierarchies

Learning Sparse and Invariant Features Hierarchies

Understanding how the visual cortex builds invariant representations is one of the most challenging problems in visual neuroscience. The feed-forward, multi-stage Hubel and Wiesel architecture [1, 2, 3, 4, 5] stacks multiple levels of alternating layers of simple cells that perform feature extraction, and complex cells that pool together features of a given type within a local receptive field. These computational models have been successfully applied to handwriting recognition [1, 2], and generic object recognition [4, 5]. Learning features in existing models consists in handcrafting the first layers and training the upper layers by recording templates from the training set, which leads to inefficient representations [4, 5], or in training the entire architecture supervised, which requires large training sets [2, 3]. We propose a fully unsupervised algorithm for learning sparse and locally invariant features at all levels. Each simple-cell layer is composed of multiple convolution filters followed by a winner-take-all competition within a local area, and a sigmoid non-linearity. For training, each simple-cell layer is coupled with a feed-back layer whose role is to reconstruct the input of the simple-cell layer from its output. These coupled layers are trained simultaneously to minimize the average reconstruction error. The output of a simple-cell layer can be seen as a sparse overcomplete representation of its input. The complex cells add the simple cell activities of one filter within the area over which the winner-take-all operation is performed, yielding representations that are invariant to small displacements of the input stimulus. The training procedure is similar to [6], but the local winner-take-all competition ensures that the representation is spatially sparse (and the complex-cell representation locally invariant). The next stage of simple-cell and complex-cell layers is trained in an identical fashion on the outputs of the first layer of complex cells [7], resulting in higher level, more invariant representations, that are then fed to a supervised classifier. Such a procedure yields 0.64% error on MNIST dataset (handwritten digits), and 54% average recognition rate on the Caltech-101 dataset (101 object categories, 30 training samples per category), demonstrating good performance even with few labeled training samples.

Yann LeCun | Fu Jie Huang

[1] David G. Lowe,et al. Multiclass Object Recognition with Sparse, Localized Features , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[2] Yoshua Bengio,et al. Gradient-based learning applied to document recognition , 1998, Proc. IEEE.