Entropy-based Training Methods for Scalable Neural Implicit Sampler

Efficiently sampling from un-normalized target distributions is a fundamental problem in scientific computing and machine learning. Traditional approaches like Markov Chain Monte Carlo (MCMC) guarantee asymptotically unbiased samples from such distributions but suffer from computational inefficiency, particularly when dealing with high-dimensional targets, as they require numerous iterations to generate a batch of samples. In this paper, we propose an efficient and scalable neural implicit sampler that overcomes these limitations. Our sampler can generate large batches of samples with low computational costs by leveraging a neural transformation that directly maps easily sampled latent vectors to target samples without the need for iterative procedures. To train the neural implicit sampler, we introduce two novel methods: the KL training method and the Fisher training method. The former minimizes the Kullback-Leibler divergence, while the latter minimizes the Fisher divergence. By employing these training methods, we effectively optimize the neural implicit sampler to capture the desired target distribution. To demonstrate the effectiveness, efficiency, and scalability of our proposed samplers, we evaluate them on three sampling benchmarks with different scales. These benchmarks include sampling from 2D targets, Bayesian inference, and sampling from high-dimensional energy-based models (EBMs). Notably, in the experiment involving high-dimensional EBMs, our sampler produces samples that are comparable to those generated by MCMC-based methods while being more than 100 times more efficient, showcasing the efficiency of our neural sampler. We believe that the theoretical and empirical contributions presented in this work will stimulate further research on developing efficient samplers for various applications beyond the ones explored in this study.

[1]  Will Grathwohl,et al.  Score-Based Diffusion meets Annealed Importance Sampling , 2022, NeurIPS.

[2]  Cheng Lu,et al.  Maximum Likelihood Training for Score-Based Diffusion ODEs by High-Order Denoising Score Matching , 2022, ICML.

[3]  David J. Fleet,et al.  Video Diffusion Models , 2022, NeurIPS.

[4]  Prafulla Dhariwal,et al.  GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models , 2021, ICML.

[5]  Yongxin Chen,et al.  Path Integral Sampler: a stochastic control approach for sampling , 2021, ICLR.

[6]  Vincent Fortuin,et al.  Neural Variational Gradient Descent , 2021, ArXiv.

[7]  Wenbo Gong,et al.  Interpreting diffusion score matching using normalizing flow , 2021, ArXiv.

[8]  Arnaud Doucet,et al.  Annealed Flow Transport Monte Carlo , 2021, ICML.

[9]  Abhishek Kumar,et al.  Score-Based Generative Modeling through Stochastic Differential Equations , 2020, ICLR.

[10]  Lantao Yu,et al.  Autoregressive Score Matching , 2020, NeurIPS.

[11]  Hang Su,et al.  Bi-level Score Matching for Learning Energy-based Latent Variable Models , 2020, NeurIPS.

[12]  Heishiro Kanagawa,et al.  Blindness of score-based methods to isolated components and mixing proportions , 2020, 2008.10087.

[13]  S. Ermon,et al.  Efficient Learning of Generative Models via Finite-Difference Score Matching , 2020, NeurIPS.

[14]  Hao Wu,et al.  Stochastic Normalizing Flows , 2020, NeurIPS.

[15]  Zengyi Li,et al.  Learning Energy-Based Models in High-Dimensional Spaces with Multi-scale Denoising Score Matching , 2019, 1910.07762.

[16]  Yang Song,et al.  Sliced Score Matching: A Scalable Approach to Density and Score Estimation , 2019, UAI.

[17]  Guang Cheng,et al.  Stein Neural Sampler , 2018, ArXiv.

[18]  Jascha Sohl-Dickstein,et al.  Generalizing Hamiltonian Monte Carlo with Neural Networks , 2017, ICLR.

[19]  Sepp Hochreiter,et al.  GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium , 2017, NIPS.

[20]  Michael Betancourt,et al.  A Conceptual Introduction to Hamiltonian Monte Carlo , 2017, 1701.02434.

[21]  Kevin Gimpel,et al.  Gaussian Error Linear Units (GELUs) , 2016, 1606.08415.

[22]  Samy Bengio,et al.  Density estimation using Real NVP , 2016, ICLR.

[23]  Nikos Komodakis,et al.  Wide Residual Networks , 2016, BMVC.

[24]  Qiang Liu,et al.  A Kernelized Stein Discrepancy for Goodness-of-fit Tests , 2016, ICML.

[25]  Yang Lu,et al.  A Theory of Generative ConvNet , 2016, ICML.

[26]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Lester W. Mackey,et al.  Measuring Sample Quality with Stein's Method , 2015, NIPS.

[28]  Shakir Mohamed,et al.  Variational Inference with Normalizing Flows , 2015, ICML.

[29]  Miguel Lázaro-Gredilla,et al.  Doubly Stochastic Variational Bayes for non-Conjugate Inference , 2014, ICML.

[30]  M. Girolami,et al.  Langevin diffusions and the Metropolis-adjusted Langevin algorithm , 2013, 1309.2983.

[31]  Pascal Vincent,et al.  A Connection Between Score Matching and Denoising Autoencoders , 2011, Neural Computation.

[32]  Radford M. Neal MCMC Using Hamiltonian Dynamics , 2011, 1206.1901.

[33]  Aapo Hyvärinen,et al.  Estimation of Non-Normalized Statistical Models by Score Matching , 2005, J. Mach. Learn. Res..

[34]  Denis J. Dean,et al.  Comparative accuracies of artificial neural networks and discriminant analysis in predicting forest cover types from cartographic variables , 1999 .

[35]  P. Deuflhard,et al.  A Direct Approach to Conformational Dynamics Based on Hybrid Monte Carlo , 1999 .

[36]  Radford M. Neal Annealed importance sampling , 1998, Stat. Comput..

[37]  P. Green Reversible jump Markov chain Monte Carlo computation and Bayesian model determination , 1995 .

[38]  Olsson,et al.  Two phase transitions in the fully frustrated XY model. , 1995, Physical review letters.

[39]  C. Stein Estimation of the Mean of a Multivariate Normal Distribution , 1981 .

[40]  W. K. Hastings,et al.  Monte Carlo Sampling Methods Using Markov Chains and Their Applications , 1970 .

[41]  Fu Jie Huang,et al.  A Tutorial on Energy-Based Learning , 2006 .

[42]  Nando de Freitas,et al.  An Introduction to MCMC for Machine Learning , 2004, Machine Learning.

[43]  O. Johnson Information Theory And The Central Limit Theorem , 2004 .

[44]  J. Rosenthal,et al.  Optimal scaling of discrete approximations to Langevin diffusions , 1998 .

[45]  M. Hutchinson A stochastic estimator of the trace of the influence matrix for laplacian smoothing splines , 1989 .