Algorithms for estimating the partition function of restricted Boltzmann machines

Abstract Accurate estimates of the normalization constants (partition functions) of energy-based probabilistic models (Markov random fields) are highly important, for example, for assessing the performance of models, monitoring training progress, and conducting likelihood ratio tests. Several algorithms for estimating the partition function (in relation to a reference distribution) have been introduced, including Annealed Importance Sampling (AIS) and Bennett's Acceptance Ratio method (BAR). However, their conceptual similarities and differences have not been worked out so far and systematic comparisons of their behavior in practice have been missing. We devise a unifying theoretical framework for these algorithms, which comprises existing variants and suggests new approaches. It is based on a generalized form of Crooks' equality linking the expectation over a distribution of samples generated by a transition operator to the expectation over the distribution induced by the reversed operator. The framework covers different ways of generating samples, such as parallel tempering and path sampling. An empirical comparison revealed the differences between the methods when estimating the partition function of restricted Boltzmann machines and Ising models. In our experiments, BAR using parallel tempering worked well with a small number of bridging distributions, while path sampling based AIS performed best when many bridging distributions were available. Because BAR gave the overall best results, we favor it over AIS. Furthermore, the experiments showed the importance of choosing a proper reference distribution.

[1]  Ruslan Salakhutdinov,et al.  Learning in Markov Random Fields using Tempered Transitions , 2009, NIPS.

[2]  E. Ising Beitrag zur Theorie des Ferromagnetismus , 1925 .

[3]  Michael R. Shirts,et al.  Statistically optimal analysis of samples from multiple equilibrium states. , 2008, The Journal of chemical physics.

[4]  Geoffrey E. Hinton Training Products of Experts by Minimizing Contrastive Divergence , 2002, Neural Computation.

[5]  David J. C. MacKay,et al.  Information Theory, Inference, and Learning Algorithms , 2004, IEEE Transactions on Information Theory.

[6]  Paul Smolensky,et al.  Information processing in dynamical systems: foundations of harmony theory , 1986 .

[7]  G. Crooks Path-ensemble averages in systems driven far from equilibrium , 1999, cond-mat/9908420.

[8]  Christian Igel,et al.  A bound for the convergence rate of parallel tempering for sampling restricted Boltzmann machines , 2015, Theor. Comput. Sci..

[9]  Radford M. Neal Estimating Ratios of Normalizing Constants Using Linked Importance Sampling , 2005, math/0511216.

[10]  Ruslan Salakhutdinov,et al.  On the Quantitative Analysis of Decoder-Based Generative Models , 2016, ICLR.

[11]  Charles H. Bennett,et al.  Efficient estimation of free energy differences from Monte Carlo data , 1976 .

[12]  Radford M. Neal Annealed importance sampling , 1998, Stat. Comput..

[13]  Christian Igel,et al.  Training restricted Boltzmann machines: An introduction , 2014, Pattern Recognit..

[14]  C. Geyer Estimating Normalizing Constants and Reweighting Mixtures , 1994 .

[15]  Yoshua Bengio,et al.  On Tracking The Partition Function , 2011, NIPS.

[16]  Ruslan Salakhutdinov,et al.  On the quantitative analysis of deep belief networks , 2008, ICML '08.

[17]  Andreas C. Müller,et al.  Investigating Convergence of Restricted Boltzmann Machine Learning , 2010 .

[18]  Michael R. Shirts,et al.  Equilibrium free energies from nonequilibrium measurements using maximum-likelihood methods. , 2003, Physical review letters.

[19]  Pascal Vincent,et al.  Tempered Markov Chain Monte Carlo for training of Restricted Boltzmann Machines , 2010, AISTATS.

[20]  S. Brush History of the Lenz-Ising Model , 1967 .

[21]  John W. Fisher,et al.  Estimating the Partition Function by Discriminance Sampling , 2015, UAI.

[22]  F. Barahona On the computational complexity of Ising spin glass models , 1982 .

[23]  Xiao-Li Meng,et al.  Simulating Normalizing Constants: From Importance Sampling to Bridge Sampling to Path Sampling , 1998 .

[24]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[25]  Christian Igel,et al.  The flip-the-state transition operator for restricted Boltzmann machines , 2013, Machine Learning.