Oops I Took A Gradient: Scalable Sampling for Discrete Distributions

We propose a general and scalable approximate sampling strategy for probabilistic models with discrete variables. Our approach uses gradients of the likelihood function with respect to its discrete inputs to propose updates in a MetropolisHastings sampler. We show empirically that this approach outperforms generic samplers in a number of difficult settings including Ising models, Potts models, restricted Boltzmann machines, and factorial hidden Markov models. We also demonstrate the use of our improved sampler for training deep energy-based models (EBM) on high dimensional discrete data. This approach outperforms variational auto-encoders and existing energy-based models. Finally, we give bounds showing that our approach is near-optimal in the class of samplers which propose local updates.

[1]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[2]  Jun S. Liu Peskun's theorem and a modified discrete-state Gibbs sampler , 1996 .

[3]  Yang Song,et al.  Generative Modeling by Estimating Gradients of the Data Distribution , 2019, NeurIPS.

[4]  S. Richardson,et al.  Bayesian Models for Sparse Regression Analysis of High Dimensional Data , 2012 .

[5]  Mohammad Norouzi,et al.  Your Classifier is Secretly an Energy Based Model and You Should Treat it Like One , 2019, ICLR.

[6]  Zhijian Ou,et al.  Learning Neural Random Fields with Inclusive Auxiliary Generators , 2018, ArXiv.

[7]  Quoc V. Le,et al.  Searching for Activation Functions , 2018, arXiv.

[8]  G. Stormo,et al.  Correlated mutations in models of protein sequences: phylogenetic and structural effects , 1999 .

[9]  Yee Whye Teh,et al.  The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables , 2016, ICLR.

[10]  이상헌,et al.  Deep Belief Networks , 2010, Encyclopedia of Machine Learning.

[11]  Aapo Hyvärinen,et al.  Noise-contrastive estimation: A new estimation principle for unnormalized statistical models , 2010, AISTATS.

[12]  Radford M. Neal MCMC Using Hamiltonian Dynamics , 2011, 1206.1901.

[13]  Caiming Xiong,et al.  Joint Energy-based Model Training for Better Calibrated Natural Language Understanding Models , 2021, ArXiv.

[14]  Umrigar,et al.  Accelerated Metropolis method. , 1993, Physical review letters.

[15]  J. Besag Statistical Analysis of Non-Lattice Data , 1975 .

[16]  Geoffrey E. Hinton Training Products of Experts by Minimizing Contrastive Divergence , 2002, Neural Computation.

[17]  Qiang Liu,et al.  Stein Variational Inference for Discrete Distributions , 2020, AISTATS.

[18]  Ruslan Salakhutdinov,et al.  Accurate and conservative estimates of MRF log-likelihood using reverse annealing , 2014, AISTATS.

[19]  J. Rosenthal,et al.  Adaptive Gibbs samplers and related MCMC methods , 2011, 1101.5838.

[20]  Tijmen Tieleman,et al.  Training restricted Boltzmann machines using approximations to the likelihood gradient , 2008, ICML '08.

[21]  Debora S. Marks,et al.  Variational Inference for Sparse and Undirected Models , 2016, ICML.

[22]  Michael I. Miller,et al.  REPRESENTATIONS OF KNOWLEDGE IN COMPLEX SYSTEMS , 1994 .

[23]  Myle Ott,et al.  Residual Energy-Based Models for Text Generation , 2020, ICLR.

[24]  Geoffrey E. Hinton,et al.  Using fast weights to improve persistent contrastive divergence , 2009, ICML '09.

[25]  Jürgen Schmidhuber,et al.  Learning to Forget: Continual Prediction with LSTM , 2000, Neural Computation.

[26]  Max Welling,et al.  VAE with a VampPrior , 2017, AISTATS.

[27]  Radford M. Neal Annealed importance sampling , 1998, Stat. Comput..

[28]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Igor Mordatch,et al.  Implicit Generation and Generalization with Energy Based Models , 2018 .

[30]  V. Climenhaga Markov chains and mixing times , 2013 .

[31]  Thomas A. Hopf,et al.  Protein 3D Structure Computed from Evolutionary Sequence Variation , 2011, PloS one.

[32]  Song-Chun Zhu,et al.  Stochastic Security: Adversarial Defense Using Long-Run Dynamics of Energy-Based Models , 2020, ICLR.

[33]  Qiang Liu,et al.  Stein Variational Gradient Descent Without Gradient , 2018, ICML.

[34]  Bernhard Schölkopf,et al.  A Kernel Two-Sample Test , 2012, J. Mach. Learn. Res..

[35]  D. Dunson,et al.  Discontinuous Hamiltonian Monte Carlo for sampling discrete parameters , 2017 .

[36]  Giacomo Zanella,et al.  Informed Proposals for Local MCMC in Discrete Spaces , 2017, Journal of the American Statistical Association.

[37]  Siwei Lyu,et al.  Interpretation and Generalization of Score Matching , 2009, UAI.

[38]  Dilin Wang,et al.  Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm , 2016, NIPS.

[39]  Richard Zemel,et al.  Learning the Stein Discrepancy for Training and Evaluating Energy-Based Models without Sampling , 2020, ICML.

[40]  Tian Han,et al.  On the Anatomy of MCMC-based Maximum Likelihood Learning of Energy-Based Models , 2019, AAAI.

[41]  Beatrice Santorini,et al.  The Penn Treebank: An Overview , 2003 .

[42]  Christopher Yau,et al.  The Hamming Ball Sampler , 2015, Journal of the American Statistical Association.

[43]  Mohammad Norouzi,et al.  No MCMC for me: Amortized sampling for fast and stable training of energy-based models , 2021, ICLR.

[44]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.