Contractive Auto-Encoders: Explicit Invariance During Feature Extraction

We present in this paper a novel approach for training deterministic auto-encoders. We show that by adding a well chosen penalty term to the classical reconstruction cost function, we can achieve results that equal or surpass those attained by other regularized auto-encoders as well as denoising auto-encoders on a range of datasets. This penalty term corresponds to the Frobenius norm of the Jacobian matrix of the encoder activations with respect to the input. We show that this penalty term results in a localized space contraction which in turn yields robust features on the activation layer. Furthermore, we show how this penalty term is related to both regularized auto-encoders and denoising auto-encoders and how it can be seen as a link between deterministic and non-deterministic auto-encoders. We find empirically that this penalty helps to carve a representation that better captures the local directions of variation dictated by the data, corresponding to a lower-dimensional non-linear manifold, while being more invariant to the vast majority of directions orthogonal to the manifold. Finally, we show that by using the learned features to initialize a MLP, we achieve state of the art classification error on a range of datasets, surpassing other methods of pretraining.

[1]  Geoffrey E. Hinton,et al.  Learning representations by back-propagating errors , 1986, Nature.

[2]  Geoffrey E. Hinton,et al.  Learning representations by back-propagation errors, nature , 1986 .

[3]  Kurt Hornik,et al.  Neural networks and principal component analysis: Learning from examples without local minima , 1989, Neural Networks.

[4]  Christopher M. Bishop,et al.  Advances in Neural Information Processing Systems 8 (NIPS 1995) , 1991 .

[5]  Yann LeCun,et al.  Tangent Prop - A Formalism for Specifying Selected Invariances in an Adaptive Network , 1991, NIPS.

[6]  Christopher M. Bishop,et al.  Current address: Microsoft Research, , 2022 .

[7]  David J. Field,et al.  Sparse coding with an overcomplete basis set: A strategy employed by V1? , 1997, Vision Research.

[8]  Nathalie Japkowicz,et al.  Nonlinear Autoassociation Is Not Equivalent to PCA , 2000, Neural Computation.

[9]  Ronald,et al.  Learning representations by backpropagating errors , 2004 .

[10]  Yoshua Bengio,et al.  Greedy Layer-Wise Training of Deep Networks , 2006, NIPS.

[11]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[12]  Honglak Lee,et al.  Sparse deep belief net model for visual area V2 , 2007, NIPS.

[13]  Yoshua Bengio,et al.  An empirical evaluation of deep architectures on problems with many factors of variation , 2007, ICML '07.

[14]  Yoshua. Bengio,et al.  Learning Deep Architectures for AI , 2007, Found. Trends Mach. Learn..

[15]  Jason Weston,et al.  Deep learning via semi-supervised embedding , 2008, ICML '08.

[16]  Honglak Lee,et al.  Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations , 2009, ICML '09.

[17]  Yann LeCun,et al.  What is the best multi-stage architecture for object recognition? , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[18]  R. Fergus,et al.  Learning invariant features through topographic filter maps , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[19]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[20]  Pascal Vincent,et al.  Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion , 2010, J. Mach. Learn. Res..