On the Convergence Properties of Contrastive Divergence

Contrastive Divergence (CD) is a popular method for estimating the parameters of Markov Random Fields (MRFs) by rapidly approximating an intractable term in the gradient of the log probability. Despite CD’s empirical success, little is known about its theoretical convergence properties. In this paper, we analyze the CD1 update rule for Restricted Boltzmann Machines (RBMs) with binary variables. We show that this update is not the gradient of any function, and construct a counterintuitive “regularization function” that causes CD learning to cycle indefinitely. Nonetheless, we show that the regularized CD update has a fixed point for a large class of regularization functions using Brower’s fixed point theorem.

[1]  J. Besag Efficiency of pseudolikelihood estimation for simple Gaussian fields , 1977 .

[2]  Michael Henle,et al.  A combinatorial introduction to topology , 1978 .

[3]  Anil K. Jain,et al.  Markov Random Field Texture Models , 1983, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4]  Geoffrey E. Hinton,et al.  A Learning Algorithm for Boltzmann Machines , 1985, Cogn. Sci..

[5]  Paul Smolensky,et al.  Information processing in dynamical systems: foundations of harmony theory , 1986 .

[6]  L. Younes Estimation and annealing for Gibbsian fields , 1988 .

[7]  Stan Z. Li,et al.  Markov Random Field Models in Computer Vision , 1994, ECCV.

[8]  Dirk Roose,et al.  Wavelet-based image denoising using a Markov random field a priori model , 1997, IEEE Trans. Image Process..

[9]  Geoffrey E. Hinton Training Products of Experts by Minimizing Contrastive Divergence , 2002, Neural Computation.

[10]  Martin J. Wainwright,et al.  Image denoising using scale mixtures of Gaussians in the wavelet domain , 2003, IEEE Trans. Image Process..

[11]  Geoffrey E. Hinton,et al.  Exponential Family Harmoniums with an Application to Information Retrieval , 2004, NIPS.

[12]  Alan L. Yuille,et al.  The Convergence of Contrastive Divergences , 2004, NIPS.

[13]  Michael J. Black,et al.  Fields of Experts: a framework for learning image priors , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[14]  Aapo Hyvärinen,et al.  Estimation of Non-Normalized Statistical Models by Score Matching , 2005, J. Mach. Learn. Res..

[15]  Miguel Á. Carreira-Perpiñán,et al.  On Contrastive Divergence Learning , 2005, AISTATS.

[16]  Yoshua Bengio,et al.  Greedy Layer-Wise Training of Deep Networks , 2006, NIPS.

[17]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[18]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[19]  Aapo Hyvärinen,et al.  Connections Between Score Matching, Contrastive Divergence, and Pseudolikelihood for Continuous-Valued Variables , 2007, IEEE Transactions on Neural Networks.

[20]  Tijmen Tieleman Some investigations into energy-based models , 2007 .

[21]  Javier R. Movellan,et al.  Contrastive Divergence in Gaussian Diffusions , 2008, Neural Computation.

[22]  Tijmen Tieleman,et al.  Training restricted Boltzmann machines using approximations to the likelihood gradient , 2008, ICML '08.

[23]  Michael I. Jordan,et al.  Graphical Models, Exponential Families, and Variational Inference , 2008, Found. Trends Mach. Learn..

[24]  Yoshua Bengio,et al.  Justifying and Generalizing Contrastive Divergence , 2009, Neural Computation.