An Efficient Learning Procedure for Deep Boltzmann Machines

We present a new learning algorithm for Boltzmann machines that contain many layers of hidden variables. Data-dependent statistics are estimated using a variational approximation that tends to focus on a single mode, and data-independent statistics are estimated using persistent Markov chains. The use of two quite different techniques for estimating the two types of statistic that enter into the gradient of the log likelihood makes it practical to learn Boltzmann machines with multiple hidden layers and millions of parameters. The learning can be made more efficient by using a layer-by-layer pretraining phase that initializes the weights sensibly. The pretraining also allows the variational inference to be initialized sensibly with a single bottom-up pass. We present results on the MNIST and NORB data sets showing that deep Boltzmann machines learn very good generative models of handwritten digits and 3D objects. We also show that the features discovered by deep Boltzmann machines are a very effective way to initialize the hidden layers of feedforward neural nets, which are then discriminatively fine-tuned.

[1]  Christopher K. I. Williams,et al.  An analysis of contrastive divergence learning in gaussian boltzmann machines , 2002 .

[2]  Geoffrey E. Hinton,et al.  Using fast weights to improve persistent contrastive divergence , 2009, ICML '09.

[3]  Pascal Vincent,et al.  Tempered Markov Chain Monte Carlo for training of Restricted Boltzmann Machines , 2010, AISTATS.

[4]  Ruslan Salakhutdinov,et al.  Learning Deep Boltzmann Machines using Adaptive MCMC , 2010, ICML.

[5]  Ruslan Salakhutdinov,et al.  Learning in Markov Random Fields using Tempered Transitions , 2009, NIPS.

[6]  Bernhard Schölkopf,et al.  Training Invariant Support Vector Machines , 2002, Machine Learning.

[7]  Geoffrey E. Hinton,et al.  Exponential Family Harmoniums with an Application to Information Retrieval , 2004, NIPS.

[8]  Geoffrey E. Hinton Training Products of Experts by Minimizing Contrastive Divergence , 2002, Neural Computation.

[9]  D. Rumelhart Parallel Distributed Processing Volume 1: Foundations , 1987 .

[10]  H. Robbins A Stochastic Approximation Method , 1951 .

[11]  Paul Smolensky,et al.  Information processing in dynamical systems: foundations of harmony theory , 1986 .

[12]  Michael I. Jordan,et al.  An Introduction to Variational Methods for Graphical Models , 1999, Machine-mediated learning.

[13]  Yoshua. Bengio,et al.  Learning Deep Architectures for AI , 2007, Found. Trends Mach. Learn..

[14]  Conrad Galland,et al.  Learning in Deterministic Boltzmann Machine Networks , 1992 .

[15]  Geoffrey E. Hinton,et al.  Learning a Nonlinear Embedding by Preserving Class Neighbourhood Structure , 2007, AISTATS.

[16]  Geoffrey E. Hinton,et al.  A View of the Em Algorithm that Justifies Incremental, Sparse, and other Variants , 1998, Learning in Graphical Models.

[17]  Geoffrey E. Hinton,et al.  Neighbourhood Components Analysis , 2004, NIPS.

[18]  Marc'Aurelio Ranzato,et al.  Unsupervised Learning of Feature Hierarchies , 2009 .

[19]  Geoffrey E. Hinton,et al.  An Efficient Learning Procedure for Deep , 2010 .

[20]  Geoffrey E. Hinton,et al.  OPTIMAL PERCEPTUAL INFERENCE , 1983 .

[21]  Max Welling,et al.  Herding dynamical weights to learn , 2009, ICML '09.

[22]  Radford M. Neal Connectionist Learning of Belief Networks , 1992, Artif. Intell..

[23]  Geoffrey E. Hinton,et al.  Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.

[24]  Geoffrey E. Hinton,et al.  Phone Recognition with the Mean-Covariance Restricted Boltzmann Machine , 2010, NIPS.

[25]  Geoffrey E. Hinton,et al.  Modeling image patches with a directed hierarchy of Markov random fields , 2007, NIPS.

[26]  C. D. Gelatt,et al.  Optimization by Simulated Annealing , 1983, Science.

[27]  Tijmen Tieleman,et al.  Training restricted Boltzmann machines using approximations to the likelihood gradient , 2008, ICML '08.

[28]  Miguel Á. Carreira-Perpiñán,et al.  On Contrastive Divergence Learning , 2005, AISTATS.

[29]  Carsten Peterson,et al.  A Mean Field Theory Learning Algorithm for Neural Networks , 1987, Complex Syst..

[30]  Ruslan Salakhutdinov,et al.  On the quantitative analysis of deep belief networks , 2008, ICML '08.

[31]  Geoffrey E. Hinton,et al.  Deep Belief Networks for phone recognition , 2009 .

[32]  Hilbert J. Kappen,et al.  Boltzmann Machine Learning Using Mean Field Theory and Linear Response Correction , 1997, NIPS.

[33]  Yoshua Bengio,et al.  Extracting and composing robust features with denoising autoencoders , 2008, ICML '08.

[34]  Geoffrey E. Hinton,et al.  Deep Boltzmann Machines , 2009, AISTATS.

[35]  Geoffrey E. Hinton,et al.  Autoencoders, Minimum Description Length and Helmholtz Free Energy , 1993, NIPS.

[36]  Marc'Aurelio Ranzato,et al.  Unsupervised Learning of Invariant Feature Hierarchies with Applications to Object Recognition , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[37]  R. Zemel A minimum description length framework for unsupervised learning , 1994 .

[38]  Geoffrey E. Hinton,et al.  Implicit Mixtures of Restricted Boltzmann Machines , 2008, NIPS.

[39]  Geoffrey E. Hinton,et al.  Acoustic Modeling Using Deep Belief Networks , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[40]  Ruslan Salakhutdinov,et al.  Evaluating probabilities under high-dimensional latent variable models , 2008, NIPS.

[41]  Hugo Larochelle,et al.  Efficient Learning of Deep Boltzmann Machines , 2010, AISTATS.

[42]  Yoshua Bengio,et al.  Greedy Layer-Wise Training of Deep Networks , 2006, NIPS.

[43]  Radford M. Neal Annealed importance sampling , 1998, Stat. Comput..

[44]  Geoffrey E. Hinton,et al.  To recognize shapes, first learn to generate images. , 2007, Progress in brain research.

[45]  Yoshua Bengio,et al.  Scaling learning algorithms towards AI , 2007 .

[46]  Thomas Serre,et al.  A feedforward architecture accounts for rapid categorization , 2007, Proceedings of the National Academy of Sciences.

[47]  Geoffrey E. Hinton A Practical Guide to Training Restricted Boltzmann Machines , 2012, Neural Networks: Tricks of the Trade.

[48]  L. Younes On the convergence of markovian stochastic algorithms with rapidly decreasing ergodicity rates , 1999 .

[49]  Geoffrey E. Hinton,et al.  Restricted Boltzmann machines for collaborative filtering , 2007, ICML '07.

[50]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[51]  Y. LeCun,et al.  Learning methods for generic object recognition with invariance to pose and lighting , 2004, Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004..

[52]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[53]  Alan L. Yuille,et al.  The Convergence of Contrastive Divergences , 2004, NIPS.

[54]  Javier R. Movellan,et al.  Diffusion Networks, Products of Experts, and Factor Analysis , 2001 .

[55]  Donald Geman,et al.  Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images , 1984, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[56]  Javier R. Movellan,et al.  DIFFUSION NETWORKS , PRODUCT OF EXPERTS , AND FACTOR ANALYSIS , 2001 .

[57]  Geoffrey E. Hinton,et al.  Replicated Softmax: an Undirected Topic Model , 2009, NIPS.

[58]  L. Younes Parametric Inference for imperfectly observed Gibbsian fields , 1989 .