Justifying and Generalizing Contrastive Divergence

We study an expansion of the log likelihood in undirected graphical models such as the restricted Boltzmann machine (RBM), where each term in the expansion is associated with a sample in a Gibbs chain alternating between two random variables (the visible vector and the hidden vector in RBMs). We are particularly interested in estimators of the gradient of the log likelihood obtained through this expansion. We show that its residual term converges to zero, justifying the use of a truncationrunning only a short Gibbs chain, which is the main idea behind the contrastive divergence (CD) estimator of the log-likelihood gradient. By truncating even more, we obtain a stochastic reconstruction error, related through a mean-field approximation to the reconstruction error often used to train autoassociators and stacked autoassociators. The derivation is not specific to the particular parametric forms used in RBMs and requires only convergence of the Gibbs chain. We present theoretical and empirical evidence linking the number of Gibbs steps k and the magnitude of the RBM parameters to the bias in the CD estimator. These experiments also suggest that the sign of the CD estimator is correct most of the time, even when the bias is large, so that CD-k is a good descent direction even for small k.

[1]  Geoffrey E. Hinton,et al.  Learning representations by back-propagating errors , 1986, Nature.

[2]  Paul Smolensky,et al.  Information processing in dynamical systems: foundations of harmony theory , 1986 .

[3]  Geoffrey E. Hinton,et al.  Learning and relearning in Boltzmann machines , 1986 .

[4]  James L. McClelland,et al.  Parallel distributed processing: explorations in the microstructure of cognition, vol. 1: foundations , 1986 .

[5]  David Haussler,et al.  Unsupervised learning of distributions on binary vectors using two layer networks , 1991, NIPS 1991.

[6]  Geoffrey E. Hinton,et al.  Autoencoders, Minimum Description Length and Helmholtz Free Energy , 1993, NIPS.

[7]  Maurice Milgram,et al.  Transformation Invariant Autoassociation with Application to Handwritten Character Recognition , 1994, NIPS.

[8]  A. Jefferson Offutt,et al.  An Empirical Evaluation , 1994 .

[9]  Geoffrey E. Hinton Products of experts , 1999 .

[10]  Terrence J. Sejnowski,et al.  Unsupervised Learning , 2018, Encyclopedia of GIS.

[11]  Nathalie Japkowicz,et al.  Nonlinear Autoassociation Is Not Equivalent to PCA , 2000, Neural Computation.

[12]  D. Mackay,et al.  Failures of the One-Step Learning Algorithm , 2001 .

[13]  Geoffrey E. Hinton Training Products of Experts by Minimizing Contrastive Divergence , 2002, Neural Computation.

[14]  O. Hernández-Lerma,et al.  Markov chains and invariant probabilities , 2003 .

[15]  Geoffrey E. Hinton,et al.  Exponential Family Harmoniums with an Application to Information Retrieval , 2004, NIPS.

[16]  Alan L. Yuille,et al.  The Convergence of Contrastive Divergences , 2004, NIPS.

[17]  Ronald,et al.  Learning representations by backpropagating errors , 2004 .

[18]  H. Bourlard,et al.  Auto-association by multilayer perceptrons and singular value decomposition , 1988, Biological Cybernetics.

[19]  Amos Storkey,et al.  Advances in Neural Information Processing Systems 20 , 2007 .

[20]  Miguel Á. Carreira-Perpiñán,et al.  On Contrastive Divergence Learning , 2005, AISTATS.

[21]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[22]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[23]  Marc'Aurelio Ranzato,et al.  Efficient Learning of Sparse Representations with an Energy-Based Model , 2006, NIPS.

[24]  Thomas Hofmann,et al.  Greedy Layer-Wise Training of Deep Networks , 2007 .

[25]  B. Schölkopf,et al.  Modeling Human Motion Using Binary Latent Variables , 2007 .

[26]  Jason Weston,et al.  Large-scale kernel machines , 2007 .

[27]  Yoshua Bengio,et al.  Scaling learning algorithms towards AI , 2007 .

[28]  Yoshua Bengio,et al.  An empirical evaluation of deep architectures on problems with many factors of variation , 2007, ICML '07.

[29]  Tijmen Tieleman,et al.  Training restricted Boltzmann machines using approximations to the likelihood gradient , 2008, ICML '08.

[30]  Geoffrey E. Hinton,et al.  Semantic hashing , 2009, Int. J. Approx. Reason..

[31]  Christian Igel,et al.  Bounding the Bias of Contrastive Divergence Learning , 2011, Neural Computation.