Parallel and Flexible Sampling from Autoregressive Models via Langevin Dynamics

This paper introduces an alternative approach to sampling from autoregressive models. Autoregressive models are typically sampled sequentially, according to the transition dynamics defined by the model. Instead, we propose a sampling procedure that initializes a sequence with white noise and follows a Markov chain defined by Langevin dynamics on the global log-likelihood of the sequence. This approach parallelizes the sampling process and generalizes to conditional sampling. Using an autoregressive model as a Bayesian prior, we can steer the output of a generative model using a conditional likelihood or constraints. We apply these techniques to autoregressive models in the visual and audio domains, with competitive results for audio source separation, super-resolution, and inpainting.

[1]  Koray Kavukcuoglu,et al.  Pixel Recurrent Neural Networks , 2016, ICML.

[2]  Stephen J. Wright,et al.  Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent , 2011, NIPS.

[3]  Maximilian Ilse,et al.  Problems using deep generative models for probabilistic audio source separation , 2020, ArXiv.

[4]  Stefano Ermon,et al.  Nonlinear Equation Solving: A Faster Alternative to Feedforward Computation , 2020, ArXiv.

[5]  Yee Whye Teh,et al.  Bayesian Learning via Stochastic Gradient Langevin Dynamics , 2011, ICML.

[6]  Nima Mesgarani,et al.  Conv-TasNet: Surpassing Ideal Time–Frequency Magnitude Masking for Speech Separation , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[7]  C. D. Gelatt,et al.  Optimization by Simulated Annealing , 1983, Science.

[8]  Sungwon Kim,et al.  FloWaveNet : A Generative Flow for Raw Audio , 2018, ICML.

[9]  Ali Razavi,et al.  Generating Diverse High-Fidelity Images with VQ-VAE-2 , 2019, NeurIPS.

[10]  Zhiyao Duan,et al.  Adversarial Training for Speech Super-Resolution , 2019, IEEE Journal of Selected Topics in Signal Processing.

[11]  John Thickstun,et al.  Source Separation with Deep Generative Priors , 2020, ICML.

[12]  Mohammad Norouzi,et al.  Pixel Recursive Super Resolution , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[13]  Matthias Bethge,et al.  Generative Image Modeling Using Spatial LSTMs , 2015, NIPS.

[14]  Frank Nielsen,et al.  DeepBach: a Steerable Model for Bach Chorales Generation , 2016, ICML.

[15]  Wei Ping,et al.  ClariNet: Parallel Wave Generation in End-to-End Text-to-Speech , 2018, ICLR.

[16]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[17]  Xi Chen,et al.  PixelCNN++: Improving the PixelCNN with Discretized Logistic Mixture Likelihood and Other Modifications , 2017, ICLR.

[18]  Zheng Wang,et al.  Bayesian Inverse Problems with l1 Priors: A Randomize-Then-Optimize Approach , 2016, SIAM J. Sci. Comput..

[19]  Ilya Sutskever,et al.  Jukebox: A Generative Model for Music , 2020, ArXiv.

[20]  Chris Donahue,et al.  Adversarial Audio Synthesis , 2018, ICLR.

[21]  Stefano Ermon,et al.  Improved Techniques for Training Score-Based Generative Models , 2020, NeurIPS.

[22]  Yun Fu,et al.  Residual Dense Network for Image Super-Resolution , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[23]  Igor Mordatch,et al.  Implicit Generation and Modeling with Energy Based Models , 2019, NeurIPS.

[24]  Francis Bach,et al.  Music Source Separation in the Waveform Domain , 2019, ArXiv.

[25]  Yang Song,et al.  Generative Modeling by Estimating Gradients of the Data Distribution , 2019, NeurIPS.

[26]  S. Ermon,et al.  Anytime Sampling for Autoregressive Models via Ordered Autoencoding , 2021, ICLR.

[27]  Heiga Zen,et al.  WaveGrad: Estimating Gradients for Waveform Generation , 2021, ICLR.

[28]  Hugo Larochelle,et al.  RNADE: The real-valued neural autoregressive density-estimator , 2013, NIPS.

[29]  A. V. D. Vaart,et al.  BAYESIAN INVERSE PROBLEMS WITH GAUSSIAN PRIORS , 2011, 1103.2692.

[30]  Jonathan Le Roux,et al.  SDR – Half-baked or Well Done? , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[31]  Wei Ping,et al.  DiffWave: A Versatile Diffusion Model for Audio Synthesis , 2020, ICLR.

[32]  Abhishek Kumar,et al.  Score-Based Generative Modeling through Stochastic Differential Equations , 2020, ICLR.

[33]  Yoshua Bengio,et al.  MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis , 2019, NeurIPS.

[34]  Dustin Tran,et al.  Image Transformer , 2018, ICML.

[35]  Stephen J. Wright,et al.  Computational Methods for Sparse Solution of Linear Inverse Problems , 2010, Proceedings of the IEEE.

[36]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[37]  Kainan Peng,et al.  WaveFlow: A Compact Flow-based Model for Raw Audio , 2020, ICML.

[38]  Yoshua Bengio,et al.  SampleRNN: An Unconditional End-to-End Neural Audio Generation Model , 2016, ICLR.

[39]  Yedid Hoshen,et al.  Neural separation of observed and unobserved distributions , 2018, ICML.

[40]  Radford M. Neal MCMC Using Hamiltonian Dynamics , 2011, 1206.1901.

[41]  M.E. Davies,et al.  Source separation using single channel ICA , 2007, Signal Process..

[42]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[43]  Junichi Yamagishi,et al.  SUPERSEDED - CSTR VCTK Corpus: English Multi-speaker Corpus for CSTR Voice Cloning Toolkit , 2016 .

[44]  Alexandros G. Dimakis,et al.  Compressed Sensing using Generative Models , 2017, ICML.

[45]  Ryan Prenger,et al.  Waveglow: A Flow-based Generative Network for Speech Synthesis , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[46]  Hugo Larochelle,et al.  The Neural Autoregressive Distribution Estimator , 2011, AISTATS.

[47]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[48]  Oriol Vinyals,et al.  Neural Discrete Representation Learning , 2017, NIPS.

[49]  Ilya Sutskever,et al.  Generating Long Sequences with Sparse Transformers , 2019, ArXiv.

[50]  Heiga Zen,et al.  Parallel WaveNet: Fast High-Fidelity Speech Synthesis , 2017, ICML.

[51]  Yejin Choi,et al.  The Curious Case of Neural Text Degeneration , 2019, ICLR.

[52]  Alexei A. Efros,et al.  Context Encoders: Feature Learning by Inpainting , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[53]  Chun-Liang Li,et al.  One Network to Solve Them All — Solving Linear Inverse Problems Using Deep Projection Models , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[54]  Craig Stuart Sapp,et al.  SUPRA: Digitizing the Stanford University Piano Roll Archive , 2019, ISMIR.

[55]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[56]  Yuqi Li,et al.  GAN-Based Projector for Faster Recovery With Convergence Guarantees in Linear Inverse Problems , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).