Training restricted Boltzmann machines: An introduction

Restricted Boltzmann machines (RBMs) are probabilistic graphical models that can be interpreted as stochastic neural networks. They have attracted much attention as building blocks for the multi-layer learning systems called deep belief networks, and variants and extensions of RBMs have found application in a wide range of pattern recognition tasks. This tutorial introduces RBMs from the viewpoint of Markov random fields, starting with the required concepts of undirected graphical models. Different learning algorithms for RBMs, including contrastive divergence learning and parallel tempering, are discussed. As sampling from RBMs, and therefore also most of their learning algorithms, are based on Markov chain Monte Carlo (MCMC) methods, an introduction to Markov chains and MCMC techniques is provided. Experiments demonstrate relevant aspects of RBM training. HighlightsWe review the state-of-the-art in training restricted Boltzmann machines (RBMs) from the perspective of graphical models.Variants and extensions of RBMs are used in a wide range of pattern recognition tasks.The required background on graphical models and Markov chain Monte Carlo methods is provided.Theoretical and experimental results are presented.

[1]  W. K. Hastings,et al.  Monte Carlo Sampling Methods Using Markov Chains and Their Applications , 1970 .

[2]  Razvan Pascanu,et al.  Learning Algorithms for the Classification Restricted Boltzmann Machine , 2012, J. Mach. Learn. Res..

[3]  Tapani Raiko,et al.  Enhanced Gradient and Adaptive Learning Rate for Training Restricted Boltzmann Machines , 2011, ICML.

[4]  Ruslan Salakhutdinov,et al.  On the quantitative analysis of deep belief networks , 2008, ICML '08.

[5]  Klaus-Robert Müller,et al.  Deep Boltzmann Machines and the Centering Trick , 2012, Neural Networks: Tricks of the Trade.

[6]  Geoffrey E. Hinton,et al.  A Learning Algorithm for Boltzmann Machines , 1985, Cogn. Sci..

[7]  Geoffrey E. Hinton,et al.  Replicated Softmax: an Undirected Topic Model , 2009, NIPS.

[8]  Geoffrey E. Hinton,et al.  Phone recognition using Restricted Boltzmann Machines , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[9]  Geoffrey E. Hinton,et al.  Using fast weights to improve persistent contrastive divergence , 2009, ICML '09.

[10]  Yoshua Bengio,et al.  Better Mixing via Deep Representations , 2012, ICML.

[11]  P. Diaconis,et al.  COMPARISON THEOREMS FOR REVERSIBLE MARKOV CHAINS , 1993 .

[12]  D. Woodard,et al.  Sufficient Conditions for Torpid Mixing of Parallel and Simulated Tempering , 2009 .

[13]  Tapani Raiko,et al.  Enhanced Gradient for Training Restricted Boltzmann Machines , 2013, Neural Computation.

[14]  Madeleine B. Thompson A Comparison of Methods for Computing Autocorrelation Time , 2010, 1011.0175.

[15]  Ilya Sutskever,et al.  On the Convergence Properties of Contrastive Divergence , 2010, AISTATS.

[16]  Pascal Vincent,et al.  Quickly Generating Representative Samples from an RBM-Derived Process , 2011, Neural Computation.

[17]  Klaus-Robert Müller,et al.  Efficient BackProp , 2012, Neural Networks: Tricks of the Trade.

[18]  Geoffrey E. Hinton A Practical Guide to Training Restricted Boltzmann Machines , 2012, Neural Networks: Tricks of the Trade.

[19]  Christian Igel,et al.  The flip-the-state transition operator for restricted Boltzmann machines , 2013, Machine Learning.

[20]  Emile H. L. Aarts,et al.  Boltzmann machines , 1998 .

[21]  Ron Meir,et al.  Density Estimation Through Convex Combinations of Densities: Approximation and Estimation Bounds , 1997, Neural Networks.

[22]  Yann LeCun,et al.  Convolutional Learning of Spatio-temporal Features , 2010, ECCV.

[23]  Madeleine B. Thompson Introduction to SamplerCompare , 2011 .

[24]  John Odentrantz,et al.  Markov Chains: Gibbs Fields, Monte Carlo Simulation, and Queues , 2000, Technometrics.

[25]  Geoffrey E. Hinton,et al.  Deep Boltzmann Machines , 2009, AISTATS.

[26]  C. Geyer Markov Chain Monte Carlo Maximum Likelihood , 1991 .

[27]  Nicolas Le Roux,et al.  Representational Power of Restricted Boltzmann Machines and Deep Belief Networks , 2008, Neural Computation.

[28]  P. Peskun,et al.  Optimum Monte-Carlo sampling using Markov chains , 1973 .

[29]  P. Diaconis,et al.  LOGARITHMIC SOBOLEV INEQUALITIES FOR FINITE MARKOV CHAINS , 1996 .

[30]  Wang,et al.  Nonuniversal critical dynamics in Monte Carlo simulations. , 1987, Physical review letters.

[31]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[32]  Geoffrey E. Hinton,et al.  Factored conditional restricted Boltzmann Machines for modeling motion style , 2009, ICML '09.

[33]  Geoffrey E. Hinton,et al.  Conditional Restricted Boltzmann Machines for Structured Output Prediction , 2011, UAI.

[34]  Michael I. Jordan Graphical Models , 1998 .

[35]  Yoshua Bengio,et al.  Unsupervised Models of Images by Spikeand-Slab RBMs , 2011, ICML.

[36]  Geoffrey E. Hinton,et al.  Learning internal representations by error propagation , 1986 .

[37]  Radford M. Neal Estimating Ratios of Normalizing Constants Using Linked Importance Sampling , 2005, math/0511216.

[38]  Nan Wang,et al.  An analysis of Gaussian-binary restricted Boltzmann machines for natural images , 2012, ESANN.

[39]  Honglak Lee,et al.  Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations , 2009, ICML '09.

[40]  Nicolas Le Roux,et al.  Deep Belief Networks Are Compact Universal Approximators , 2010, Neural Computation.

[41]  R. Swendsen,et al.  Cluster Monte Carlo algorithms , 1990 .

[42]  Tafsir Thiam,et al.  The Boltzmann machine , 1999, IJCNN'99. International Joint Conference on Neural Networks. Proceedings (Cat. No.99CH36339).

[43]  Igel Christian,et al.  Contrastive Divergence Learning May Diverge When Training Restricted Boltzmann Machines , 2009 .

[44]  G. Crooks Path-ensemble averages in systems driven far from equilibrium , 1999, cond-mat/9908420.

[45]  Yoshua Bengio,et al.  Classification using discriminative restricted Boltzmann machines , 2008, ICML '08.

[46]  Shun-ichi Amari,et al.  Natural Gradient Works Efficiently in Learning , 1998, Neural Computation.

[47]  Shun-ichi Amari,et al.  Information geometry of Boltzmann machines , 1992, IEEE Trans. Neural Networks.

[48]  Benjamin Schwehn Using the Natural Gradient for training Restricted Boltzmann Machines , 2010 .

[49]  Yoshua. Bengio,et al.  Learning Deep Architectures for AI , 2007, Found. Trends Mach. Learn..

[50]  Geoffrey E. Hinton,et al.  Generative versus discriminative training of RBMs for classification of fMRI images , 2008, NIPS.

[51]  Christian Igel,et al.  A bound for the convergence rate of parallel tempering for sampling restricted Boltzmann machines , 2015, Theor. Comput. Sci..

[52]  Tapani Raiko,et al.  Gaussian-Bernoulli deep Boltzmann machine , 2013, The 2013 International Joint Conference on Neural Networks (IJCNN).

[53]  Yoshua Bengio,et al.  On Tracking The Partition Function , 2011, NIPS.

[54]  Radford M. Neal Probabilistic Inference Using Markov Chain Monte Carlo Methods , 2011 .

[55]  Nir Friedman,et al.  Probabilistic Graphical Models - Principles and Techniques , 2009 .

[56]  D. Woodard,et al.  Conditions for Rapid and Torpid Mixing of Parallel and Simulated Tempering on Multimodal Distributions , 2009, 0906.2341.

[57]  Yoshua Bengio,et al.  Why Does Unsupervised Pre-training Help Deep Learning? , 2010, AISTATS.

[58]  Christian Igel,et al.  Empirical evaluation of the improved Rprop learning algorithms , 2003, Neurocomputing.

[59]  Nando de Freitas,et al.  A tutorial on stochastic approximation algorithms for training Restricted Boltzmann Machines and Deep Belief Nets , 2010, 2010 Information Theory and Applications Workshop (ITA).

[60]  Christian Igel,et al.  Bounding the Bias of Contrastive Divergence Learning , 2011, Neural Computation.

[61]  Xiao-Li Meng,et al.  Simulating Normalizing Constants: From Importance Sampling to Bridge Sampling to Path Sampling , 1998 .

[62]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[63]  Oswin Krause,et al.  Approximation properties of DBNs with binary hidden units and real-valued visible units , 2013, ICML.

[64]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[65]  Christian Osendorfer,et al.  Music Similarity Estimation with the Mean-Covariance Restricted Boltzmann Machine , 2011, 2011 10th International Conference on Machine Learning and Applications and Workshops.

[66]  Martin A. Riedmiller,et al.  Advanced supervised learning in multi-layer perceptrons — From backpropagation to adaptive learning algorithms , 1994 .

[67]  Alan L. Yuille,et al.  The Convergence of Contrastive Divergences , 2004, NIPS.

[68]  Christian Igel,et al.  An Introduction to Restricted Boltzmann Machines , 2012, CIARP.

[69]  Geoffrey E. Hinton,et al.  Robust Boltzmann Machines for recognition and denoising , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[70]  Tijmen Tieleman,et al.  Training restricted Boltzmann machines using approximations to the likelihood gradient , 2008, ICML '08.

[71]  Miguel Á. Carreira-Perpiñán,et al.  On Contrastive Divergence Learning , 2005, AISTATS.

[72]  Pascal Vincent,et al.  Tempered Markov Chain Monte Carlo for training of Restricted Boltzmann Machines , 2010, AISTATS.

[73]  Persi Diaconis,et al.  What Do We Know about the Metropolis Algorithm? , 1998, J. Comput. Syst. Sci..

[74]  P. Tavan,et al.  Efficiency of exchange schemes in replica exchange , 2009 .

[75]  Yoshua Bengio,et al.  Adaptive Parallel Tempering for Stochastic Maximum Likelihood Learning of RBMs , 2010, ArXiv.

[76]  Geoffrey E. Hinton,et al.  Learning representations by back-propagating errors , 1986, Nature.

[77]  S. Adler Over-relaxation method for the Monte Carlo evaluation of the partition function for multiquadratic actions , 1981 .

[78]  Razvan Pascanu,et al.  Metric-Free Natural Gradient for Joint-Training of Boltzmann Machines , 2013, ICLR.

[79]  Christopher K. I. Williams,et al.  Multiple Texture Boltzmann Machines , 2012, AISTATS.

[80]  Pascal Vincent,et al.  Parallel Tempering for Training of Restricted Boltzmann Machines , 2010 .

[81]  Tapani Raiko,et al.  Parallel tempering is efficient for learning restricted Boltzmann machines , 2010, The 2010 International Joint Conference on Neural Networks (IJCNN).

[82]  Andreas C. Müller,et al.  Investigating Convergence of Restricted Boltzmann Machine Learning , 2010 .

[83]  Jun S. Liu,et al.  Metropolized independent sampling with comparisons to rejection sampling and importance sampling , 1996, Stat. Comput..

[84]  Nicolas Le Roux,et al.  Learning a Generative Model of Images by Factoring Appearance and Shape , 2011, Neural Computation.

[85]  Geoffrey E. Hinton,et al.  Restricted Boltzmann machines for collaborative filtering , 2007, ICML '07.

[86]  Michael R. Shirts,et al.  Equilibrium free energies from nonequilibrium measurements using maximum-likelihood methods. , 2003, Physical review letters.

[87]  Geoffrey E. Hinton,et al.  Modeling Human Motion Using Binary Latent Variables , 2006, NIPS.

[88]  Rong Yan,et al.  Mining Associated Text and Images with Dual-Wing Harmoniums , 2005, UAI.

[89]  Neal Madras,et al.  On the swapping algorithm , 2003, Random Struct. Algorithms.

[90]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[91]  Geoffrey E. Hinton Learning multiple layers of representation , 2007, Trends in Cognitive Sciences.

[92]  Kazuyuki Aihara,et al.  Robust Generation of Dynamical Patterns in Human Motion by a Deep Belief Nets , 2011, ACML.

[93]  Nihat Ay,et al.  Refinements of Universal Approximation Results for Deep Belief Networks and Restricted Boltzmann Machines , 2010, Neural Computation.

[94]  Yoshua Bengio,et al.  An empirical evaluation of deep architectures on problems with many factors of variation , 2007, ICML '07.

[95]  Andrew R. Barron,et al.  Mixture Density Estimation , 1999, NIPS.

[96]  Geoffrey E. Hinton,et al.  Learning a Nonlinear Embedding by Preserving Class Neighbourhood Structure , 2007, AISTATS.

[97]  Geoffrey E. Hinton,et al.  Exponential Family Harmoniums with an Application to Information Retrieval , 2004, NIPS.

[98]  Wolff,et al.  Collective Monte Carlo updating for spin systems. , 1989, Physical review letters.

[99]  Yoshua Bengio,et al.  Justifying and Generalizing Contrastive Divergence , 2009, Neural Computation.

[100]  Charles H. Bennett,et al.  Efficient estimation of free energy differences from Monte Carlo data , 1976 .

[101]  Ilya Sutskever,et al.  Data Normalization in the Learning of Restricted Boltzmann Machines , 2011 .

[102]  Donald Geman,et al.  Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images , 1984, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[103]  Max Welling,et al.  Product of experts , 2007, Scholarpedia.

[104]  Dana Randall,et al.  Torpid mixing of simulated tempering on the Potts model , 2004, SODA '04.

[105]  Radford M. Neal Annealed importance sampling , 1998, Stat. Comput..

[106]  David J. C. MacKay,et al.  Information Theory, Inference, and Learning Algorithms , 2004, IEEE Transactions on Information Theory.

[107]  Benjamin Schrauwen,et al.  Training Restricted Boltzmann Machines with Multi-tempering: Harnessing Parallelization , 2012, ICANN.

[108]  Tapani Raiko,et al.  Deep Learning Made Easier by Linear Transformations in Perceptrons , 2012, AISTATS.

[109]  Yoshua Bengio,et al.  Greedy Layer-Wise Training of Deep Networks , 2006, NIPS.

[110]  Christian Igel,et al.  Empirical Analysis of the Divergence of Gibbs Sampling Based Learning Algorithms for Restricted Boltzmann Machines , 2010, ICANN.

[111]  R. Salakhutdinov Learning and Evaluating Boltzmann Machines , 2008 .

[112]  Honglak Lee,et al.  Unsupervised feature learning for audio classification using convolutional deep belief networks , 2009, NIPS.

[113]  Wang,et al.  Replica Monte Carlo simulation of spin glasses. , 1986, Physical review letters.

[114]  Christian Igel,et al.  Training RBMs based on the signs of the CD approximation of the log-likelihood derivatives , 2011, ESANN.

[115]  D. Randall,et al.  Markov chain decomposition for convergence rate analysis , 2002 .

[116]  Peter V. Gehler,et al.  The rate adapting poisson model for information retrieval and object recognition , 2006, ICML.

[117]  D. Mackay,et al.  Failures of the One-Step Learning Algorithm , 2001 .

[118]  Geoffrey E. Hinton,et al.  Learning and relearning in Boltzmann machines , 1986 .

[119]  James L. McClelland,et al.  Parallel distributed processing: explorations in the microstructure of cognition, vol. 1: foundations , 1986 .

[120]  Marc'Aurelio Ranzato,et al.  Unsupervised Learning of Invariant Feature Hierarchies with Applications to Object Recognition , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[121]  Ruslan Salakhutdinov,et al.  Learning in Markov Random Fields using Tempered Transitions , 2009, NIPS.

[122]  Geoffrey E. Hinton Training Products of Experts by Minimizing Contrastive Divergence , 2002, Neural Computation.

[123]  Nan Wang,et al.  How to Center Binary Restricted Boltzmann Machines , 2013, ArXiv.

[124]  Paul Smolensky,et al.  Information processing in dynamical systems: foundations of harmony theory , 1986 .