Non Gaussian Denoising Diffusion Models

Generative diffusion processes are an emerging and effective tool for image and speech generation. In the existing methods, the underline noise distribution of the diffusion process is Gaussian noise. However, fitting distributions with more degrees of freedom, could help the performance of such generative models. In this work, we investigate other types of noise distribution for the diffusion process. Specifically, we show that noise from Gamma distribution provides improved results for image and speech generation. Moreover, we show that using a mixture of Gaussian noise variables in the diffusion process improves the performance over a diffusion process that is based on a single distribution. Our approach preserves the ability to efficiently sample state in the training diffusion process while using Gamma noise and a mixture of noise.

[1]  Surya Ganguli,et al.  Deep Unsupervised Learning using Nonequilibrium Thermodynamics , 2015, ICML.

[2]  Xiaogang Wang,et al.  Deep Learning Face Attributes in the Wild , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[3]  Sepp Hochreiter,et al.  GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium , 2017, NIPS.

[4]  Ying Nian Wu,et al.  Learning Energy-Based Models by Diffusion Recovery Likelihood , 2020, ICLR.

[5]  Pieter Abbeel,et al.  Denoising Diffusion Probabilistic Models , 2020, NeurIPS.

[6]  Heiga Zen,et al.  WaveGrad: Estimating Gradients for Waveform Generation , 2021, ICLR.

[7]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[8]  Yang Song,et al.  Generative Modeling by Estimating Gradients of the Data Distribution , 2019, NeurIPS.

[9]  Yinda Zhang,et al.  LSUN: Construction of a Large-scale Image Dataset using Deep Learning with Humans in the Loop , 2015, ArXiv.

[10]  Aapo Hyvärinen,et al.  Estimation of Non-Normalized Statistical Models by Score Matching , 2005, J. Mach. Learn. Res..

[11]  Prafulla Dhariwal,et al.  Glow: Generative Flow with Invertible 1x1 Convolutions , 2018, NeurIPS.

[12]  Jiaming Song,et al.  Denoising Diffusion Implicit Models , 2021, ICLR.

[13]  Ali Razavi,et al.  Generating Diverse High-Fidelity Images with VQ-VAE-2 , 2019, NeurIPS.

[14]  Jan Kautz,et al.  NVAE: A Deep Hierarchical Variational Autoencoder , 2020, NeurIPS.

[15]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[16]  Andries P. Hekstra,et al.  Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[17]  Erich Elsen,et al.  High Fidelity Speech Synthesis with Adversarial Networks , 2019, ICLR.

[18]  Jesper Jensen,et al.  An Algorithm for Intelligibility Prediction of Time–Frequency Weighted Noisy Speech , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[19]  Wei Ping,et al.  DiffWave: A Versatile Diffusion Model for Audio Synthesis , 2020, ICLR.

[20]  Qiang Liu,et al.  A Kernelized Stein Discrepancy for Goodness-of-fit Tests , 2016, ICML.

[21]  Jaakko Lehtinen,et al.  Analyzing and Improving the Image Quality of StyleGAN , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Chris Donahue,et al.  Adversarial Audio Synthesis , 2018, ICLR.

[23]  R. Kubichek,et al.  Mel-cepstral distance measure for objective speech quality assessment , 1993, Proceedings of IEEE Pacific Rim Conference on Communications Computers and Signal Processing.

[24]  Abhishek Kumar,et al.  Score-Based Generative Modeling through Stochastic Differential Equations , 2020, ICLR.

[25]  Lawrence M Leemis,et al.  Univariate Distribution Relationships , 2008 .

[26]  Erich Elsen,et al.  Efficient Neural Audio Synthesis , 2018, ICML.