Adiabatic Persistent Contrastive Divergence learning

This paper studies the problem of parameter learning in graphical models having latent variables, where the standard approach is the expectation maximization algorithm alternating expectation (E) and maximization (M) steps. However, both E and M steps are computationally intractable for high dimensional data, while the substitution of one step to a faster surrogate for combating against intractability can often cause failure in convergence. To tackle the issue, the Contrastive Divergence (CD) learning scheme has been popularly used in the deep learning community, where it runs the mean-field approximation in E step and a few cycles of Markov Chains (MC) in M step. In this paper, we analyze a variant of CD, called Adiabatic Persistent Contrastive Divergence (APCD), which runs a few cycles of MCs in both E and M steps. Using multi-time-scale stochastic approximation theory, we prove that APCD converges to a correct optimum, where the standard CD is impossible to have such a guarantee due to the mean-field approximation gap in E step. Despite of such stronger theoretical guarantee of APCD, its possible drawback is on slow mixing on E step for practical purposes. To address the issue, we also design a hybrid approach applying both mean-field and MC approximations in E step, where it outperforms the standard mean-field-based CD in our experiments on real-world datasets.

[1]  V. Fock,et al.  Beweis des Adiabatensatzes , 1928 .

[2]  Ruslan Salakhutdinov,et al.  Learning in Markov Random Fields using Tempered Transitions , 2009, NIPS.

[3]  Paul Smolensky,et al.  Information processing in dynamical systems: foundations of harmony theory , 1986 .

[4]  Yoshua Bengio,et al.  Deep Generative Stochastic Networks Trainable by Backprop , 2013, ICML.

[5]  Pascal Vincent,et al.  Tempered Markov Chain Monte Carlo for training of Restricted Boltzmann Machines , 2010, AISTATS.

[6]  E. Kuhn,et al.  Coupling a stochastic approximation version of EM with an MCMC procedure , 2004 .

[7]  Ilya Sutskever,et al.  On the Convergence Properties of Contrastive Divergence , 2010, AISTATS.

[8]  Pascal Vincent,et al.  Quickly Generating Representative Samples from an RBM-Derived Process , 2011, Neural Computation.

[9]  Nando de Freitas,et al.  A tutorial on stochastic approximation algorithms for training Restricted Boltzmann Machines and Deep Belief Nets , 2010, 2010 Information Theory and Applications Workshop (ITA).

[10]  L. Younes On the convergence of markovian stochastic algorithms with rapidly decreasing ergodicity rates , 1999 .

[11]  Geoffrey E. Hinton,et al.  Deep Boltzmann Machines , 2009, AISTATS.

[12]  Tijmen Tieleman,et al.  Training restricted Boltzmann machines using approximations to the likelihood gradient , 2008, ICML '08.

[13]  Alexandre Proutière,et al.  Resource Allocation over Network Dynamics without Timescale Separation , 2010, 2010 Proceedings IEEE INFOCOM.

[14]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[15]  V. Borkar Stochastic Approximation: A Dynamical Systems Viewpoint , 2008 .

[16]  Hugo Larochelle,et al.  Efficient Learning of Deep Boltzmann Machines , 2010, AISTATS.

[17]  Radford M. Neal Annealed importance sampling , 1998, Stat. Comput..

[18]  Geoffrey E. Hinton,et al.  Restricted Boltzmann machines for collaborative filtering , 2007, ICML '07.

[19]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[20]  Ruslan Salakhutdinov,et al.  Learning Deep Boltzmann Machines using Adaptive MCMC , 2010, ICML.

[21]  G. Parisi,et al.  Simulated tempering: a new Monte Carlo scheme , 1992, hep-lat/9205018.

[22]  Geoffrey E. Hinton,et al.  Phone Recognition with the Mean-Covariance Restricted Boltzmann Machine , 2010, NIPS.

[23]  Yoshua Bengio,et al.  Classification using discriminative restricted Boltzmann machines , 2008, ICML '08.

[24]  Geoffrey E. Hinton,et al.  Implicit Mixtures of Restricted Boltzmann Machines , 2008, NIPS.

[25]  Geoffrey E. Hinton,et al.  A New Learning Algorithm for Mean Field Boltzmann Machines , 2002, ICANN.

[26]  Yoshua Bengio,et al.  Better Mixing via Deep Representations , 2012, ICML.

[27]  É. Moulines,et al.  Convergence of a stochastic approximation version of the EM algorithm , 1999 .

[28]  Alan L. Yuille,et al.  The Convergence of Contrastive Divergences , 2004, NIPS.

[29]  Tapani Raiko,et al.  Parallel tempering is efficient for learning restricted Boltzmann machines , 2010, The 2010 International Joint Conference on Neural Networks (IJCNN).