Closed‐form expressions for maximum mean discrepancy with applications to Wasserstein auto‐encoders

The Maximum Mean Discrepancy (MMD) has found numerous applications in statistics and machine learning, most recently as a penalty in the Wasserstein Auto-Encoder (WAE). In this paper we compute closed-form expressions for estimating the Gaussian kernel based MMD between a given distribution and the standard multivariate normal distribution. We introduce the standardized version of MMD as a penalty for the WAE training objective, allowing for a better interpretability of MMD values and more compatibility across different hyperparameter settings. Next, we propose using a version of batch normalization at the code layer; this has the benefits of making the kernel width selection easier, reducing the training effort, and preventing outliers in the aggregate code distribution. Finally, we discuss the appropriate null distributions and provide thresholds for multivariate normality testing with the standardized MMD, leading to a number of easy rules of thumb for monitoring the progress of WAE training. Curiously, our MMD formula reveals a connection to the Baringhaus-Henze-Epps-Pulley (BHEP) statistic of the Henze-Zirkler test and provides further insights about the MMD. Our experiments on synthetic and real data show that the analytic formulation improves over the commonly used stochastic approximation of the MMD, and demonstrate that code normalization provides significant benefits when training WAEs.

[1]  Jacek Tabor,et al.  Cramer-Wold AutoEncoder , 2018, J. Mach. Learn. Res..

[2]  Navdeep Jaitly,et al.  Adversarial Autoencoders , 2015, ArXiv.

[3]  Bernhard Schölkopf,et al.  Kernel Mean Embedding of Distributions: A Review and Beyonds , 2016, Found. Trends Mach. Learn..

[4]  Wojciech Zaremba,et al.  B-test: A Non-parametric, Low Variance Kernel Two-sample Test , 2013, NIPS.

[5]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[6]  Alexander J. Smola,et al.  Generative Models and Model Criticism via Optimized Maximum Mean Discrepancy , 2016, ICLR.

[7]  O. Bousquet,et al.  From optimal transport to generative modeling: the VEGAN cookbook , 2017, 1705.07642.

[8]  Sándor Csörgő,et al.  Consistency of some tests for multivariate normality , 1989 .

[9]  Bernhard Schölkopf,et al.  A Kernel Two-Sample Test , 2012, J. Mach. Learn. Res..

[10]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[11]  Stefano Ermon,et al.  InfoVAE: Balancing Learning and Inference in Variational Autoencoders , 2019, AAAI.

[12]  Karl J. Friston Ten ironic rules for non-statistical reviewers , 2012, NeuroImage.

[13]  N. Henze,et al.  A consistent test for multivariate normality based on the empirical characteristic function , 1988 .

[14]  Norbert Henze,et al.  A class of invariant consistent tests for multivariate normality , 1990 .

[15]  L. E. Clarke,et al.  Probability and Measure , 1980 .

[16]  Gustavo K. Rohde,et al.  Sliced Wasserstein Auto-Encoders , 2018, ICLR.

[17]  P. Westfall Kurtosis as Peakedness, 1905–2014. R.I.P. , 2014, The American statistician.

[18]  Norbert Henze,et al.  Extreme smoothing and testing for multivariate normality , 1997 .

[19]  Sergey Ioffe,et al.  Batch Renormalization: Towards Reducing Minibatch Dependence in Batch-Normalized Models , 2017, NIPS.

[20]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[21]  Barnabás Póczos,et al.  On the Decreasing Power of Kernel and Distance Based Nonparametric Hypothesis Tests in High Dimensions , 2014, AAAI.

[22]  K. Mardia Measures of multivariate skewness and kurtosis with applications , 1970 .

[23]  Richard S. Zemel,et al.  Generative Moment Matching Networks , 2015, ICML.

[24]  T. W. Epps,et al.  A test for normality based on the empirical characteristic function , 1983 .

[25]  Zoubin Ghahramani,et al.  Training generative neural networks via Maximum Mean Discrepancy optimization , 2015, UAI.

[26]  Jacob Cohen Statistical Power Analysis for the Behavioral Sciences , 1969, The SAGE Encyclopedia of Research Design.

[27]  Norbert Henze,et al.  A New Approach to the BHEP Tests for Multivariate Normality , 1997 .

[28]  Aleksander Madry,et al.  How Does Batch Normalization Help Optimization? (No, It Is Not About Internal Covariate Shift) , 2018, NeurIPS.