Purely supervised Convolutional Networks yield excellent accuracy on image recognition tasks when data is plentiful [1]. But until now, they have not produced state-of-the-art accuracy on object recognition benchmarks for which few labeled samp les er category are available. For example, on the popular Caltech-101 dataset with 30 samp les for each of the 101 categories, methods that use hand-designed features, such as SI FT and Geometric Blur combined with a kernel classifier, achieve accuracies of 66.2% [5], an d 64.6% [6]. By contrast, a purely supervised convolutional network with standard sigmoid no n-linearities yields only 26%. This abstract describes a modified ConvNet architecture with a ne w unsupervised/supervised training procedure that can reach 67.2% accuracy on Caltech-101. This work explores several architectural designs and train ing methods and studies their effect on the accuracy for object recognition. The convolut i nal network under consideration takes a 143x143 grayscale image as input. The preprocessing includes removing mean and performing a local contrast normalization (dividing each p ixel by the standard deviation of its neighbors). The first stage has 64 filters of size 9x9, followe d by a subsampling layer with 5x5 stride, and 10x10 averaging window. The second stage has 256 feature map, each with 16 filters connected to a random subset of first-layer feature ma ps. The subsampling layer has a stride of 4x4 and a 6x6 averaging window. Hence, the input to t he last layer has 256 feature maps of size 4x4 (4096 dimensions). Figure 1 shows the outlin e of a convolutional net, and figure 2 shows the best sequence of transformations at each st age of the network. The results are shown in table . The most surprising result is that simply adding an absolute value after the hyperbolic tangent (tanh) non-linearity pr actically doubles the recognition rate from 26% to 58% with purely supervised training. We conjectu re that the advantage of a rectifying non-linearity is to remove redundant informati on (the polarity of features), and at the same time, to avoids cancellations of neighboring opposite filter responses in the subsampling layers. Adding a local contrast normalization step after ea ch feature extraction layer [4] further improves the accuracy to 60%. The second interesting result is that pre-training each sta ge one after the other using a new unsupervised method, and adjusting the resulting network u sing supervised gradient descent bumps up the accuracy to 67.2%. The procedure is reminiscent of several recent proposal for “deep learning” [2, 3]. Our layer-wise unsupervised traini ng method is called Predictive Sparse Decomposition (PSD). It consist in learning an overcomplet e s t of basis functions from which
[1]
Geoffrey E. Hinton,et al.
Reducing the Dimensionality of Data with Neural Networks
,
2006,
Science.
[2]
Marc'Aurelio Ranzato,et al.
Fast Inference in Sparse Coding Algorithms with Applications to Object Recognition
,
2010,
ArXiv.
[3]
Nicolas Pinto,et al.
Why is Real-World Visual Object Recognition Hard?
,
2008,
PLoS Comput. Biol..
[4]
Cordelia Schmid,et al.
Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories
,
2006,
2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).
[5]
Jitendra Malik,et al.
SVM-KNN: Discriminative Nearest Neighbor Classification for Visual Category Recognition
,
2006,
2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).
[6]
Marc'Aurelio Ranzato,et al.
Unsupervised Learning of Invariant Feature Hierarchies with Applications to Object Recognition
,
2007,
2007 IEEE Conference on Computer Vision and Pattern Recognition.
[7]
Yoshua Bengio,et al.
Gradient-based learning applied to document recognition
,
1998,
Proc. IEEE.