A tutorial on stochastic approximation algorithms for training Restricted Boltzmann Machines and Deep Belief Nets

In this study, we provide a direct comparison of the Stochastic Maximum Likelihood algorithm and Contrastive Divergence for training Restricted Boltzmann Machines using the MNIST data set. We demonstrate that Stochastic Maximum Likelihood is superior when using the Restricted Boltzmann Machine as a classifier, and that the algorithm can be greatly improved using the technique of iterate averaging from the field of stochastic approximation. We further show that training with optimal parameters for classification does not necessarily lead to optimal results when Restricted Boltzmann Machines are stacked to form a Deep Belief Network. In our experiments we observe that fine tuning a Deep Belief Network significantly changes the distribution of the latent data, even though the parameter changes are negligible.

[1]  Bruno A. Olshausen,et al.  Learning Horizontal Connections in a Sparse Coding Model of Natural Images , 2007, NIPS.

[2]  Geoffrey E. Hinton,et al.  Semantic hashing , 2009, Int. J. Approx. Reason..

[3]  V. Borkar Stochastic Approximation: A Dynamical Systems Viewpoint , 2008 .

[4]  Yoshua Bengio,et al.  Exploring Strategies for Training Deep Neural Networks , 2009, J. Mach. Learn. Res..

[5]  H. Kushner,et al.  Asymptotic Properties of Stochastic Approximations with Constant Coefficients. , 1981 .

[6]  H. Kushner,et al.  Averaging Methods for the Asymptotic Analysis of Learning and Adaptive Systems, with Small Adjustment Rate. Analysis of Nonlinear Stochastic Systems with Wide-Band Inputs. , 1980 .

[7]  Honglak Lee,et al.  Sparse deep belief net model for visual area V2 , 2007, NIPS.

[8]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[9]  Michael I. Jordan,et al.  Learning with Mixtures of Trees , 2001, J. Mach. Learn. Res..

[10]  Léon Bottou,et al.  The Tradeoffs of Large Scale Learning , 2007, NIPS.

[11]  Alan L. Yuille,et al.  The Convergence of Contrastive Divergences , 2004, NIPS.

[12]  Boris Polyak Some methods of speeding up the convergence of iteration methods , 1964 .

[13]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[14]  Yoshua Bengio,et al.  Scaling learning algorithms towards AI , 2007 .

[15]  Eugenius Kaszkurewicz,et al.  Steepest descent with momentum for quadratic functions is a version of the conjugate gradient method , 2004, Neural Networks.

[16]  J. Spall Adaptive stochastic approximation by the simultaneous perturbation method , 1998, Proceedings of the 37th IEEE Conference on Decision and Control (Cat. No.98CH36171).

[17]  Max Welling,et al.  Herding Dynamic Weights for Partially Observed Random Field Models , 2009, UAI.

[18]  Geoffrey E. Hinton Training Products of Experts by Minimizing Contrastive Divergence , 2002, Neural Computation.

[19]  B. Delyon General results on the convergence of stochastic algorithms , 1996, IEEE Trans. Autom. Control..

[20]  Tijmen Tieleman,et al.  Training restricted Boltzmann machines using approximations to the likelihood gradient , 2008, ICML '08.

[21]  Miguel Á. Carreira-Perpiñán,et al.  On Contrastive Divergence Learning , 2005, AISTATS.

[22]  Quoc V. Le,et al.  Measuring Invariances in Deep Networks , 2009, NIPS.

[23]  William A. Sethares,et al.  Analysis of momentum adaptive filtering algorithms , 1998, IEEE Trans. Signal Process..

[24]  Harold J. Kushner,et al.  Stochastic Approximation Algorithms and Applications , 1997, Applications of Mathematics.

[25]  Terrence J. Sejnowski,et al.  Slow Feature Analysis: Unsupervised Learning of Invariances , 2002, Neural Computation.

[26]  Boris Polyak,et al.  Acceleration of stochastic approximation by averaging , 1992 .

[27]  Thomas Hofmann,et al.  Greedy Layer-Wise Training of Deep Networks , 2007 .

[28]  L. Younes Parametric Inference for imperfectly observed Gibbsian fields , 1989 .

[29]  Yoshua Bengio,et al.  Classification using discriminative restricted Boltzmann machines , 2008, ICML '08.

[30]  Geoffrey E. Hinton,et al.  Exponential Family Harmoniums with an Application to Information Retrieval , 2004, NIPS.

[31]  Christophe Andrieu,et al.  A tutorial on adaptive MCMC , 2008, Stat. Comput..

[32]  Alexander V. Nazin,et al.  Generalization Error Bounds for Aggregation by Mirror Descent with Averaging , 2005, NIPS.

[33]  D. Ruppert A Newton-Raphson Version of the Multivariate Robbins-Monro Procedure , 1985 .

[34]  Honglak Lee,et al.  Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations , 2009, ICML '09.

[35]  Pascal Vincent,et al.  The Difficulty of Training Deep Architectures and the Effect of Unsupervised Pre-Training , 2009, AISTATS.

[36]  P. Kumar,et al.  Theory and practice of recursive identification , 1985, IEEE Transactions on Automatic Control.

[37]  D. George,et al.  Hierarchical Temporal Memory Concepts , Theory , and Terminology , 2006 .

[38]  Yann LeCun,et al.  The mnist database of handwritten digits , 2005 .

[39]  Geoffrey E. Hinton,et al.  Restricted Boltzmann machines for collaborative filtering , 2007, ICML '07.

[40]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[41]  Pierre Priouret,et al.  Adaptive Algorithms and Stochastic Approximations , 1990, Applications of Mathematics.