Unbiased Contrastive Divergence Algorithm for Training Energy-Based Latent Variable Models

The contrastive divergence algorithm is a popular approach to training energy-based latent variable models, which has been widely used in many machine learning models such as the restricted Boltzmann machines and deep belief nets. Despite its empirical success, the contrastive divergence algorithm is also known to have biases that severely affect its convergence. In this article we propose an unbiased version of the contrastive divergence algorithm that completely removes its bias in stochastic gradient methods, based on recent advances on unbiased Markov chain Monte Carlo methods. Rigorous theoretical analysis is developed to justify the proposed algorithm, and numerical experiments show that it significantly improves the existing method. Our findings suggest that the unbiased contrastive divergence algorithm is a promising approach to training general energy-based latent variable models.

[1]  W. K. Hastings,et al.  Monte Carlo Sampling Methods Using Markov Chains and Their Applications , 1970 .

[2]  Diederik P. Kingma,et al.  Stochastic Gradient VB and the Variational Auto-Encoder , 2013 .

[3]  Geoffrey E. Hinton,et al.  Using fast weights to improve persistent contrastive divergence , 2009, ICML '09.

[4]  J. Rosenthal Minorization Conditions and Convergence Rates for Markov Chain Monte Carlo , 1995 .

[5]  Lawrence D. Jackel,et al.  Handwritten Digit Recognition with a Back-Propagation Network , 1989, NIPS.

[6]  John O'Leary,et al.  Unbiased Markov chain Monte Carlo with couplings , 2017, 1708.03625.

[7]  Andreas C. Müller,et al.  Investigating Convergence of Restricted Boltzmann Machine Learning , 2010 .

[8]  Ilya Sutskever,et al.  On the Convergence Properties of Contrastive Divergence , 2010, AISTATS.

[9]  Richard L. Tweedie,et al.  Markov Chains and Stochastic Stability , 1993, Communications and Control Engineering Series.

[10]  Christian Igel,et al.  Empirical Analysis of the Divergence of Gibbs Sampling Based Learning Algorithms for Restricted Boltzmann Machines , 2010, ICANN.

[11]  Léon Bottou,et al.  Large-Scale Machine Learning with Stochastic Gradient Descent , 2010, COMPSTAT.

[12]  Bruno Ribeiro,et al.  From Monte Carlo to Las Vegas: Improving Restricted Boltzmann Machine Training Through Stopping Sets , 2018, AAAI.

[13]  Peter W. Glynn,et al.  Exact estimation for Markov chain equilibrium expectations , 2014, Journal of Applied Probability.

[14]  Geoffrey E. Hinton Training Products of Experts by Minimizing Contrastive Divergence , 2002, Neural Computation.

[15]  Paul Smolensky,et al.  Information processing in dynamical systems: foundations of harmony theory , 1986 .

[16]  H. Robbins A Stochastic Approximation Method , 1951 .

[17]  Yang Lu,et al.  Learning Generative ConvNets via Multi-grid Modeling and Sampling , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[18]  Jorge Nocedal,et al.  Optimization Methods for Large-Scale Machine Learning , 2016, SIAM Rev..

[19]  Bai Jiang,et al.  Convergence of contrastive divergence algorithm in exponential family , 2016, The Annals of Statistics.

[20]  Peter Green,et al.  Markov chain Monte Carlo in Practice , 1996 .

[21]  V. Climenhaga Markov chains and mixing times , 2013 .

[22]  Francisco J. R. Ruiz,et al.  A Contrastive Divergence for Combining Variational Inference and MCMC , 2019, ICML.

[23]  Geoffrey E. Hinton A Practical Guide to Training Restricted Boltzmann Machines , 2012, Neural Networks: Tricks of the Trade.

[24]  Tijmen Tieleman,et al.  Training restricted Boltzmann machines using approximations to the likelihood gradient , 2008, ICML '08.

[25]  Oswin Krause,et al.  Population-Contrastive-Divergence: Does consistency help with RBM training? , 2018, Pattern Recognit. Lett..

[26]  Geoffrey E. Hinton,et al.  Exponential Family Harmoniums with an Application to Information Retrieval , 2004, NIPS.

[27]  Miguel Á. Carreira-Perpiñán,et al.  On Contrastive Divergence Learning , 2005, AISTATS.

[28]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[29]  Alan L. Yuille,et al.  The Convergence of Contrastive Divergences , 2004, NIPS.

[30]  Donald Geman,et al.  Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images , 1984, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[31]  Yang Lu,et al.  Cooperative Learning of Energy-Based Model and Latent Variable Model via MCMC Teaching , 2018, AAAI.

[32]  Yang Lu,et al.  A Theory of Generative ConvNet , 2016, ICML.

[33]  Christian Igel,et al.  Training restricted Boltzmann machines: An introduction , 2014, Pattern Recognit..

[34]  N. Metropolis,et al.  Equation of State Calculations by Fast Computing Machines , 1953, Resonance.

[35]  Song-Chun Zhu,et al.  Learning Descriptor Networks for 3D Shape Synthesis and Analysis , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[36]  Erik Nijkamp,et al.  Learning Non-Convergent Non-Persistent Short-Run MCMC Toward Energy-Based Model , 2019, NeurIPS.

[37]  Alicia A. Johnson,et al.  Geometric Ergodicity and Scanning Strategies for Two-Component Gibbs Samplers , 2012 .