A bound for the convergence rate of parallel tempering for sampling restricted Boltzmann machines

Sampling from restricted Boltzmann machines (RBMs) is done by Markov chain Monte Carlo (MCMC) methods. The faster the convergence of the Markov chain, the more efficiently can high quality samples be obtained. This is also important for robust training of RBMs, which usually relies on sampling. Parallel tempering (PT), an MCMC method that maintains several replicas of the original chain at higher temperatures, has been successfully applied for RBM training. We present the first analysis of the convergence rate of PT for sampling from binary RBMs. The resulting bound on the rate of convergence of the PT Markov chain shows an exponential dependency on the size of one layer and the absolute values of the RBM parameters. It is minimized by a uniform spacing of the inverse temperatures, which is often used in practice. Similarly as in the derivation of bounds on the approximation error for contrastive divergence learning, our bound on the mixing time implies an upper bound on the error of the gradient approximation when the method is used for RBM training.

[1]  D. Woodard,et al.  Conditions for Torpid Mixing of Parallel and Simulated Tempering on Multimodal Distributions , 2022 .

[2]  D. Woodard,et al.  Sufficient Conditions for Torpid Mixing of Parallel and Simulated Tempering , 2009 .

[3]  Radford M. Neal Sampling from multimodal distributions using tempered transitions , 1996, Stat. Comput..

[4]  Neal Madras,et al.  On the swapping algorithm , 2003, Random Struct. Algorithms.

[5]  Nial Friel,et al.  Tuning tempered transitions , 2010, Stat. Comput..

[6]  Christian Igel,et al.  The flip-the-state transition operator for restricted Boltzmann machines , 2013, Machine Learning.

[7]  John Odentrantz,et al.  Markov Chains: Gibbs Fields, Monte Carlo Simulation, and Queues , 2000, Technometrics.

[8]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[9]  Christian Igel,et al.  Training restricted Boltzmann machines: An introduction , 2014, Pattern Recognit..

[10]  C. Geyer Markov Chain Monte Carlo Maximum Likelihood , 1991 .

[11]  P. Diaconis,et al.  LOGARITHMIC SOBOLEV INEQUALITIES FOR FINITE MARKOV CHAINS , 1996 .

[12]  Pascal Vincent,et al.  Parallel Tempering for Training of Restricted Boltzmann Machines , 2010 .

[13]  Ruslan Salakhutdinov,et al.  Learning in Markov Random Fields using Tempered Transitions , 2009, NIPS.

[14]  Geoffrey E. Hinton Training Products of Experts by Minimizing Contrastive Divergence , 2002, Neural Computation.

[15]  D. Randall,et al.  Markov chain decomposition for convergence rate analysis , 2002 .

[16]  Tapani Raiko,et al.  Parallel tempering is efficient for learning restricted Boltzmann machines , 2010, The 2010 International Joint Conference on Neural Networks (IJCNN).

[17]  Paul Smolensky,et al.  Information processing in dynamical systems: foundations of harmony theory , 1986 .

[18]  P. Diaconis,et al.  COMPARISON THEOREMS FOR REVERSIBLE MARKOV CHAINS , 1993 .

[19]  Persi Diaconis,et al.  What do we know about the Metropolis algorithm? , 1995, STOC '95.

[20]  Yoshua Bengio,et al.  Justifying and Generalizing Contrastive Divergence , 2009, Neural Computation.

[21]  Christian Igel,et al.  Bounding the Bias of Contrastive Divergence Learning , 2011, Neural Computation.

[22]  Yoshua Bengio,et al.  Adaptive Parallel Tempering for Stochastic Maximum Likelihood Learning of RBMs , 2010, ArXiv.

[23]  D. Woodard,et al.  Conditions for Rapid and Torpid Mixing of Parallel and Simulated Tempering on Multimodal Distributions , 2009, 0906.2341.

[24]  Christian Igel,et al.  Empirical Analysis of the Divergence of Gibbs Sampling Based Learning Algorithms for Restricted Boltzmann Machines , 2010, ICANN.

[25]  Dana Randall,et al.  Torpid mixing of simulated tempering on the Potts model , 2004, SODA '04.