Reweighted Wake-Sleep

Training deep directed graphical models with many hidden variables and performing inference remains a major challenge. Helmholtz machines and deep belief networks are such models, and the wake-sleep algorithm has been proposed to train them. The wake-sleep algorithm relies on training not just the directed generative model but also a conditional generative model (the inference network) that runs backward from visible to latent, estimating the posterior distribution of latent given visible. We propose a novel interpretation of the wake-sleep algorithm which suggests that better estimators of the gradient can be obtained by sampling latent variables multiple times from the inference network. This view is based on importance sampling as an estimator of the likelihood, with the approximate inference network as a proposal distribution. This interpretation is confirmed experimentally, showing that better likelihood can be achieved with this reweighted wake-sleep procedure. Based on this interpretation, we propose that a sigmoidal belief network is not sufficiently powerful for the layers of the inference network in order to recover a good estimator of the posterior distribution of latent variables. Our experiments show that using a more powerful layer model, such as NADE, yields substantially better generative models.

[1]  Geoffrey E. Hinton,et al.  The Helmholtz Machine , 1995, Neural Computation.

[2]  Geoffrey E. Hinton,et al.  The "wake-sleep" algorithm for unsupervised neural networks. , 1995, Science.

[3]  Michael I. Jordan,et al.  Mean Field Theory for Sigmoid Belief Networks , 1996, J. Artif. Intell. Res..

[4]  Geoffrey E. Hinton,et al.  Varieties of Helmholtz Machine , 1996, Neural Networks.

[5]  Brendan J. Frey,et al.  Graphical Models for Machine Learning and Digital Communication , 1998 .

[6]  Samy Bengio,et al.  Modeling High-Dimensional Discrete Data with Multi-Layer Neural Networks , 1999, NIPS.

[7]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[8]  Yoshua. Bengio,et al.  Learning Deep Architectures for AI , 2007, Found. Trends Mach. Learn..

[9]  Ruslan Salakhutdinov,et al.  On the quantitative analysis of deep belief networks , 2008, ICML '08.

[10]  Ruslan Salakhutdinov,et al.  Evaluating probabilities under high-dimensional latent variable models , 2008, NIPS.

[11]  Hugo Larochelle,et al.  The Neural Autoregressive Distribution Estimator , 2011, AISTATS.

[12]  Yoshua Bengio,et al.  Modeling Temporal Dependencies in High-Dimensional Sequences: Application to Polyphonic Music Generation and Transcription , 2012, ICML.

[13]  Razvan Pascanu,et al.  Theano: new features and speed improvements , 2012, ArXiv.

[14]  Tapani Raiko,et al.  Enhanced Gradient for Training Restricted Boltzmann Machines , 2013, Neural Computation.

[15]  Ruslan Salakhutdinov,et al.  Learning Stochastic Feedforward Neural Networks , 2013, NIPS.

[16]  Yoshua Bengio,et al.  Better Mixing via Deep Representations , 2012, ICML.

[17]  Karol Gregor,et al.  Neural Variational Inference and Learning in Belief Networks , 2014, ICML.

[18]  Daan Wierstra,et al.  Stochastic Backpropagation and Approximate Inference in Deep Generative Models , 2014, ICML.

[19]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[20]  Daan Wierstra,et al.  Deep AutoRegressive Networks , 2013, ICML.

[21]  Hugo Larochelle,et al.  A Deep and Tractable Density Estimator , 2013, ICML.

[22]  Max Welling,et al.  Efficient Gradient-Based Inference through Transformations between Bayes Nets and Neural Nets , 2014, ICML.