The Convergence of Contrastive Divergences

The Convergence of Contrastive Divergences Alan Yuille Department of Statistics University of California at Los Angeles Los Angeles, CA 90095 yuille@stat.ucla.edu Abstract This paper analyses the Contrastive Divergence algorithm for learning statistical parameters. We relate the algorithm to the stochastic approxi- mation literature. This enables us to specify conditions under which the algorithm is guaranteed to converge to the optimal solution (with proba- bility 1). This includes necessary and sufficient conditions for the solu- tion to be unbiased. 1 Introduction Many learning problems can be reduced to statistical inference of parameters. But inference algorithms for this task tend to be very slow. Recently Hinton proposed a new algorithm called contrastive divergences (CD) [1]. Computer simulations show that this algorithm tends to converge, and to converge rapidly, although not always to the correct solution [2]. Theoretical analysis shows that CD can fail but does not give conditions which guarantee convergence [3,4]. This paper relates CD to the stochastic approximation literature [5,6] and hence derives elementary conditions which ensure convergence (with probability 1). We conjecture that far stronger results can be obtained by applying more advanced techniques such as those described by Younes [7]. We also give necessary and sufficient conditions for the solution of CD to be unbiased. Section (2) describes CD and shows that it is closely related to a class of stochastic ap- proximation algorithms for which convergence results exist. In section (3) we state and give a proof of a simple convergence theorem for stochastic approximation algorithms. Section (4) applies the theorem to give sufficient conditions for convergence of CD. 2 Contrastive Divergence and its Relations The task of statistical inference is to estimate the model parameters ω ∗ which minimize the Kullback-Leibler divergence D(P 0 (x)||P (x|ω)) between the empirical distribution func-

[1]  John ffitch,et al.  Course notes , 1975, SIGSAM Bull..

[2]  John Fitch,et al.  Course notes , 1975, SIGS.

[3]  V. Nollau Kushner, H. J./Clark, D. S., Stochastic Approximation Methods for Constrained and Unconstrained Systems. (Applied Mathematical Sciences 26). Berlin‐Heidelberg‐New York, Springer‐Verlag 1978. X, 261 S., 4 Abb., DM 26,40. US $ 13.20 , 1980 .

[4]  G. Grimmett,et al.  Probability and random processes , 2002 .

[5]  H. Kushner Asymptotic global behavior for stochastic approximation and diffusions with slowly decreasing noise effects: Global minimization via Monte Carlo , 1987 .

[6]  Todd K. Leen,et al.  Weight Space Probability Densities in Stochastic Learning: II. Transients and Basin Hopping Times , 1992, NIPS.

[7]  L. Younes On the convergence of markovian stochastic algorithms with rapidly decreasing ergodicity rates , 1999 .

[8]  John Odentrantz,et al.  Markov Chains: Gibbs Fields, Monte Carlo Simulation, and Queues , 2000, Technometrics.

[9]  D. Mackay,et al.  Failures of the One-Step Learning Algorithm , 2001 .

[10]  Geoffrey E. Hinton Training Products of Experts by Minimizing Contrastive Divergence , 2002, Neural Computation.

[11]  Song-Chun Zhu,et al.  Learning in Gibbsian Fields: How Accurate and How Fast Can It Be? , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[12]  Christopher K. I. Williams,et al.  An analysis of contrastive divergence learning in gaussian boltzmann machines , 2002 .

[13]  Yee Whye Teh,et al.  Energy-Based Models for Sparse Overcomplete Representations , 2003, J. Mach. Learn. Res..