Robust model training and generalisation with Studentising flows

Normalising flows are tractable probabilistic models that leverage the power of deep learning to describe a wide parametric family of distributions, all while remaining trainable using maximum likelihood. We discuss how these methods can be further improved based on insights from robust (in particular, resistant) statistics. Specifically, we propose to endow flow-based models with fat-tailed latent distributions such as multivariate Student's $t$, as a simple drop-in replacement for the Gaussian distribution used by conventional normalising flows. While robustness brings many advantages, this paper explores two of them: 1) We describe how using fatter-tailed base distributions can give benefits similar to gradient clipping, but without compromising the asymptotic consistency of the method. 2) We also discuss how robust ideas lead to models with reduced generalisation gap and improved held-out data likelihood. Experiments on several different datasets confirm the efficacy of the proposed approach in both regards.

[1]  Yaoliang Yu,et al.  Tails of Lipschitz Triangular Flows , 2019, ICML.

[2]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[3]  Benjamin Recht,et al.  Do ImageNet Classifiers Generalize to ImageNet? , 2019, ICML.

[4]  Shakir Mohamed,et al.  Learning in Implicit Generative Models , 2016, ArXiv.

[5]  Yaoliang Yu,et al.  Tails of Triangular Flows , 2019, ArXiv.

[6]  Dmitry Vetrov,et al.  Semi-Conditional Normalizing Flows for Semi-Supervised Learning , 2019, ArXiv.

[7]  Andre Lucas,et al.  Robustness of the student t based M-estimator , 1997 .

[8]  Léon Bottou,et al.  Wasserstein Generative Adversarial Networks , 2017, ICML.

[9]  Sergey Levine,et al.  VideoFlow: A Flow-Based Generative Model for Video , 2019, ArXiv.

[10]  D. Ruppert Robust Statistics: The Approach Based on Influence Functions , 1987 .

[11]  Prafulla Dhariwal,et al.  Glow: Generative Flow with Invertible 1x1 Convolutions , 2018, NeurIPS.

[12]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[13]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[14]  Matthias Bethge,et al.  A note on the evaluation of generative models , 2015, ICLR.

[15]  Sebastian Nowozin,et al.  f-GAN: Training Generative Neural Samplers using Variational Divergence Minimization , 2016, NIPS.

[16]  Eric Nalisnick,et al.  Normalizing Flows for Probabilistic Modeling and Inference , 2019, J. Mach. Learn. Res..

[17]  Andrew Gordon Wilson,et al.  Semi-Supervised Learning with Normalizing Flows , 2019, ICML.

[18]  Iain Murray,et al.  Neural Spline Flows , 2019, NeurIPS.

[19]  Yann LeCun,et al.  The mnist database of handwritten digits , 2005 .

[20]  H. Daniels,et al.  The Asymptotic Efficiency of a Maximum Likelihood Estimator , 1961 .

[21]  Srikanth Ronanki,et al.  Robust TTS duration modelling using DNNS , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[22]  Heiga Zen,et al.  Fast, Compact, and High Quality LSTM-RNN Based Statistical Parametric Speech Synthesizers for Mobile Devices , 2016, INTERSPEECH.

[23]  B. Ripley,et al.  Robust Statistics , 2018, Encyclopedia of Mathematical Geosciences.

[24]  Felix J. Herrmann,et al.  Robust inversion, dimensionality reduction, and randomized sampling , 2012, Math. Program..

[25]  Eric Brochu,et al.  Optimal Sub-sampling with Influence Functions , 2017, NeurIPS.

[26]  Taku Komura,et al.  A Recurrent Variational Autoencoder for Human Motion Synthesis , 2017, BMVC.

[27]  Jeff Donahue,et al.  Large Scale GAN Training for High Fidelity Natural Image Synthesis , 2018, ICLR.

[28]  S. Srihari Mixture Density Networks , 1994 .

[29]  Cordelia Schmid,et al.  Adaptive Density Estimation for Generative Models , 2019, NeurIPS.

[30]  Gustav Eje Henter,et al.  Minimum Entropy Rate Simplification of Stochastic Processes , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[31]  Rachel McDonnell,et al.  Investigating the use of recurrent motion modelling for speech gesture generation , 2018, IVA.

[32]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[33]  Percy Liang,et al.  Understanding Black-box Predictions via Influence Functions , 2017, ICML.

[34]  Jeremy MG Taylor,et al.  Robust Statistical Modeling Using the t Distribution , 1989 .

[35]  Xiaogang Wang,et al.  Deep Learning Face Attributes in the Wild , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[36]  Tido Röder,et al.  Documentation Mocap Database HDM05 , 2007 .

[37]  Jonas Beskow,et al.  Style‐Controllable Speech‐Driven Gesture Synthesis Using Normalising Flows , 2020, Comput. Graph. Forum.

[38]  Jonas Beskow,et al.  MoGlow , 2019, ACM Trans. Graph..

[39]  Samuel Kotz,et al.  Multivariate T-Distributions and Their Applications , 2004 .

[40]  Mario Lucic,et al.  Are GANs Created Equal? A Large-Scale Study , 2017, NeurIPS.