On Contrastive Divergence Learning

Maximum-likelihood (ML) learning of Markov random fields is challenging because it requires estimates of averages that have an exponential number of terms. Markov chain Monte Carlo methods typically take a long time to converge on unbiased estimates, but Hinton (2002) showed that if the Markov chain is only run for a few steps, the learning can still work well and it approximately minimizes a different function called “contrastive divergence” (CD). CD learning has been successfully applied to various types of random fields. Here, we study the properties of CD learning and show that it provides biased estimates in general, but that the bias is typically very small. Fast CD learning can therefore be used to get close to an ML solution and slow ML learning can then be used to fine-tune the CD solution. Consider a probability distribution over a vector x (assumed discrete w.l.o.g.) and with parameters W p(x;W) = 1 Z(W) e (1) where Z(W) = ∑ x e −E(x;W) is a normalisation constant and E(x;W) is an energy function. This class of random-field distributions has found many practical applications (Li, 2001; Winkler, 2002; Teh et al., 2003; He et al., 2004). Maximum-likelihood (ML) learning of the parameters W given an iid sample X = {xn}n=1 can be done by gradient ascent: W = W + η ∂L(W;X ) ∂W ∣

[1]  I. Good THE POPULATION FREQUENCIES OF SPECIES AND THE ESTIMATION OF POPULATION PARAMETERS , 1953 .

[2]  Paul Smolensky,et al.  Information processing in dynamical systems: foundations of harmony theory , 1986 .

[3]  Pierre Priouret,et al.  Adaptive Algorithms and Stochastic Approximations , 1990, Applications of Mathematics.

[4]  David Haussler,et al.  Unsupervised learning of distributions on binary vectors using two layer networks , 1991, NIPS 1991.

[5]  Sylvia Richardson,et al.  Markov Chain Monte Carlo in Practice , 1997 .

[6]  D. Mackay,et al.  Failures of the One-Step Learning Algorithm , 2001 .

[7]  Stan Z. Li,et al.  Markov Random Field Modeling in Image Analysis , 2001, Computer Science Workbench.

[8]  Geoffrey E. Hinton,et al.  Stochastic Neighbor Embedding , 2002, NIPS.

[9]  Gerhard Winkler,et al.  Image Analysis, Random Fields and Markov Chain Monte Carlo Methods: A Mathematical Introduction , 2002 .

[10]  Geoffrey E. Hinton Training Products of Experts by Minimizing Contrastive Divergence , 2002, Neural Computation.

[11]  Christopher K. I. Williams,et al.  An analysis of contrastive divergence learning in gaussian boltzmann machines , 2002 .

[12]  Alan F. Murray,et al.  Continuous restricted Boltzmann machine with an implementable training algorithm , 2003 .

[13]  Yee Whye Teh,et al.  Energy-Based Models for Sparse Overcomplete Representations , 2003, J. Mach. Learn. Res..

[14]  Miguel Á. Carreira-Perpiñán,et al.  Multiscale conditional random fields for image labeling , 2004, Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004..

[15]  Alan L. Yuille,et al.  The Convergence of Contrastive Divergences , 2004, NIPS.