Deep Unsupervised Learning using Nonequilibrium Thermodynamics

A central problem in machine learning involves modeling complex data-sets using highly flexible families of probability distributions in which learning, sampling, inference, and evaluation are still analytically or computationally tractable. Here, we develop an approach that simultaneously achieves both flexibility and tractability. The essential idea, inspired by non-equilibrium statistical physics, is to systematically and slowly destroy structure in a data distribution through an iterative forward diffusion process. We then learn a reverse diffusion process that restores structure in data, yielding a highly flexible and tractable generative model of the data. This approach allows us to rapidly learn, sample from, and evaluate probabilities in deep generative models with thousands of layers or time steps, as well as to compute conditional and posterior probabilities under the learned model. We additionally release an open source reference implementation of the algorithm.

[1]  W. Feller On the Theory of Stochastic Processes, with Particular Reference to Applications , 1949 .

[2]  J. Besag Statistical Analysis of Non-Lattice Data , 1975 .

[3]  T. Plefka Convergence condition of the TAP equation for the infinite-ranged Ising spin glass model , 1982 .

[4]  Jürgen Schmidhuber,et al.  Learning Factorial Codes by Predictability Minimization , 1992, Neural Computation.

[5]  Geoffrey E. Hinton,et al.  The Helmholtz Machine , 1995, Neural Computation.

[6]  Geoffrey E. Hinton,et al.  The "wake-sleep" algorithm for unsupervised neural networks. , 1995, Science.

[7]  J. Suykens,et al.  Nonconvex optimization using a Fokker-Planck learning machine , 1995 .

[8]  D. Mackay,et al.  Bayesian neural networks and density networks , 1995 .

[9]  D. Lemons,et al.  Paul Langevin’s 1908 paper “On the Theory of Brownian Motion” [“Sur la théorie du mouvement brownien,” C. R. Acad. Sci. (Paris) 146, 530–533 (1908)] , 1997 .

[10]  C. Jarzynski Equilibrium free-energy differences from nonequilibrium measurements: A master-equation approach , 1997, cond-mat/9707325.

[11]  Christopher M. Bishop,et al.  GTM: The Generative Topographic Mapping , 1998, Neural Computation.

[12]  Toshiyuki TANAKA Mean-field theory of Boltzmann machine learning , 1998 .

[13]  Michael I. Jordan,et al.  Loopy Belief Propagation for Approximate Inference: An Empirical Study , 1999, UAI.

[14]  Radford M. Neal Annealed importance sampling , 1998, Stat. Comput..

[15]  Geoffrey E. Hinton Training Products of Experts by Minimizing Contrastive Divergence , 2002, Neural Computation.

[16]  Geoffrey E. Hinton,et al.  A New Learning Algorithm for Mean Field Boltzmann Machines , 2002, ICANN.

[17]  David Mumford,et al.  Occlusion Models for Natural Images: A Statistical Study of a Scale-Invariant Dead Leaves Model , 2004, International Journal of Computer Vision.

[18]  Michael I. Jordan,et al.  An Introduction to Variational Methods for Graphical Models , 1999, Machine Learning.

[19]  Cordelia Schmid,et al.  A sparse texture representation using local affine regions , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[20]  Aapo Hyvärinen,et al.  Estimation of Non-Normalized Statistical Models by Score Matching , 2005, J. Mach. Learn. Res..

[21]  Yann LeCun,et al.  The mnist database of handwritten digits , 2005 .

[22]  Cristian Sminchisescu,et al.  Learning Joint Top-Down and Bottom-up Processes for 3D Visual Inference , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[23]  A. Raftery,et al.  Strictly Proper Scoring Rules, Prediction, and Estimation , 2007 .

[24]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[25]  Razvan Pascanu,et al.  Theano: A CPU and GPU Math Compiler in Python , 2010, SciPy.

[26]  Marc'Aurelio Ranzato,et al.  Fast Inference in Sparse Coding Algorithms with Applications to Object Recognition , 2010, ArXiv.

[27]  Hugo Larochelle,et al.  The Neural Autoregressive Distribution Estimator , 2011, AISTATS.

[28]  Jascha Sohl-Dickstein,et al.  Minimum Probability Flow Learning , 2009, ICML.

[29]  Samuel J. Gershman,et al.  A Tutorial on Bayesian Nonparametric Models , 2011, 1106.2697.

[30]  Siwei Lyu,et al.  Unifying Non-Maximum Likelihood Learning Objectives with Minimum KL Contraction , 2011, NIPS.

[31]  Jascha Sohl-Dickstein,et al.  A new method for parameter estimation in probabilistic models: Minimum probability flow , 2011, Physical review letters.

[32]  C. Jarzynski Equalities and Inequalities: Irreversibility and the Second Law of Thermodynamics at the Nanoscale , 2011 .

[33]  S. Lauritzen,et al.  Proper local scoring rules , 2011, 1101.5011.

[34]  M. Bethge,et al.  Mixtures of Conditional Gaussian Scale Mixtures Applied to Multiscale Image Representations , 2011, PloS one.

[35]  R. Spinney,et al.  Fluctuation Relations: A Pedagogical Overview , 2012, 1201.6381.

[36]  Hugo Larochelle,et al.  RNADE: The real-valued neural autoregressive density-estimator , 2013, NIPS.

[37]  Ryan P. Adams,et al.  High-Dimensional Probability Estimation with Deep Density Models , 2013, ArXiv.

[38]  Jitendra Malik,et al.  Volumetric Semantic Segmentation Using Pyramid Context Features , 2013, 2013 IEEE International Conference on Computer Vision.

[39]  Yoshua Bengio,et al.  Better Mixing via Deep Representations , 2012, ICML.

[40]  Ruslan Salakhutdinov,et al.  Annealing between distributions by averaging moments , 2013, NIPS.

[41]  Noah D. Goodman,et al.  Learning Stochastic Inverses , 2013, NIPS.

[42]  Daan Wierstra,et al.  Stochastic Backpropagation and Approximate Inference in Deep Generative Models , 2014, ICML.

[43]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[44]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[45]  Yoshua Bengio,et al.  Deep Generative Stochastic Networks Trainable by Backprop , 2013, ICML.

[46]  Daan Wierstra,et al.  Deep AutoRegressive Networks , 2013, ICML.

[47]  Hugo Larochelle,et al.  A Deep and Tractable Density Estimator , 2013, ICML.

[48]  K. Perez Nuclear Instruments and Methods in Physics Research Section A: Accelerators, Spectrometers, Detectors and Associated Equipment , 2014 .

[49]  Surya Ganguli,et al.  Fast large-scale optimization by unifying stochastic gradient and quasi-Newton methods , 2013, ICML.

[50]  Yoshua Bengio,et al.  Deep Directed Generative Autoencoders , 2014, ArXiv.

[51]  Daan Wierstra,et al.  Stochastic Back-propagation and Variational Inference in Deep Latent Gaussian Models , 2014, ArXiv.

[52]  Li Yao,et al.  On the Equivalence between Deep NADE and Generative Stochastic Networks , 2014, ECML/PKDD.

[53]  Yoshua Bengio,et al.  Reweighted Wake-Sleep , 2014, ICLR.

[54]  Ruslan Salakhutdinov,et al.  Accurate and conservative estimates of MRF log-likelihood using reverse annealing , 2014, AISTATS.

[55]  Yoshua Bengio,et al.  NICE: Non-linear Independent Components Estimation , 2014, ICLR.

[56]  Yoshua Bengio,et al.  Blocks and Fuel , 2015 .

[57]  Matthias Bethge,et al.  A note on the evaluation of generative models , 2015, ICLR.