The Difficulty of Training Deep Architectures and the Effect of Unsupervised Pre-Training

Whereas theoretical work suggests that deep architectures might be more e cient at representing highly-varying functions, training deep architectures was unsuccessful until the recent advent of algorithms based on unsupervised pretraining. Even though these new algorithms have enabled training deep models, many questions remain as to the nature of this di cult learning problem. Answering these questions is important if learning in deep architectures is to be further improved. We attempt to shed some light on these questions through extensive simulations. The experiments confirm and clarify the advantage of unsupervised pre-training. They demonstrate the robustness of the training procedure with respect to the random initialization, the positive e ect of pre-training in terms of optimization and its role as a regularizer. We empirically show the influence of pre-training with respect to architecture depth, model capacity, and number of training examples.

[1]  Johan Håstad,et al.  On the power of small-depth threshold circuits , 1990, Proceedings [1990] 31st Annual Symposium on Foundations of Computer Science.

[2]  David Haussler,et al.  Unsupervised learning of distributions on binary vectors using two layer networks , 1991, NIPS 1991.

[3]  Thomas G. Dietterich,et al.  Editors. Advances in Neural Information Processing Systems , 2002 .

[4]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[5]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[6]  Marc'Aurelio Ranzato,et al.  Efficient Learning of Sparse Representations with an Energy-Based Model , 2006, NIPS.

[7]  Geoffrey E. Hinton,et al.  Restricted Boltzmann machines for collaborative filtering , 2007, ICML '07.

[8]  Honglak Lee,et al.  Sparse deep belief net model for visual area V2 , 2007, NIPS.

[9]  Thomas Hofmann,et al.  Efficient Learning of Sparse Representations with an Energy-Based Model , 2007 .

[10]  Marc'Aurelio Ranzato,et al.  Sparse Feature Learning for Deep Belief Networks , 2007, NIPS.

[11]  Thomas Hofmann,et al.  Greedy Layer-Wise Training of Deep Networks , 2007 .

[12]  Geoffrey E. Hinton,et al.  Learning a Nonlinear Embedding by Preserving Class Neighbourhood Structure , 2007, AISTATS.

[13]  Yoshua Bengio,et al.  An empirical evaluation of deep architectures on problems with many factors of variation , 2007, ICML '07.

[14]  Yoshua. Bengio,et al.  Learning Deep Architectures for AI , 2007, Found. Trends Mach. Learn..

[15]  Geoffrey E. Hinton,et al.  Using Deep Belief Nets to Learn Covariance Kernels for Gaussian Processes , 2007, NIPS.

[16]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[17]  Jason Weston,et al.  A unified architecture for natural language processing: deep neural networks with multitask learning , 2008, ICML '08.

[18]  Jason Weston,et al.  Deep learning via semi-supervised embedding , 2008, ICML '08.

[19]  Yoshua Bengio,et al.  Extracting and composing robust features with denoising autoencoders , 2008, ICML '08.

[20]  Yoshua Bengio,et al.  Justifying and Generalizing Contrastive Divergence , 2009, Neural Computation.