A Mean-Field Theory for Kernel Alignment with Random Features in Generative Adversarial Networks

We propose a novel supervised learning method to optimize the kernel in maximum mean discrepancy generative adversarial networks (MMD GANs). Specifically, we characterize a distributionally robust optimization problem to compute a good distribution for the random feature model of Rahimi and Recht to approximate a good kernel function. Due to the fact that the distributional optimization is infinite dimensional, we consider a Monte-Carlo sample average approximation (SAA) to obtain a more tractable finite dimensional optimization problem. We subsequently leverage a particle stochastic gradient descent (SGD) method to solve finite dimensional optimization problems. Based on a mean-field analysis, we then prove that the empirical distribution of the interactive particles system at each iteration of the SGD follows the path of the gradient descent flow on the Wasserstein manifold. We also establish the non-asymptotic consistency of the finite sample estimator. Our empirical evaluation on synthetic data-set as well as MNIST and CIFAR-10 benchmark data-sets indicates that our proposed MMD GAN model with kernel learning indeed attains higher inception scores well as Frechet inception distances and generates better images compared to the generative moment matching network (GMMN) and MMD GAN with untrained kernels.

[1]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[2]  Yu. V. Prokhorov Convergence of Random Processes and Limit Theorems in Probability Theory , 1956 .

[3]  A. Jakubowski,et al.  On the Skorokhod topology , 1986 .

[4]  Benjamin Recht,et al.  Random Features for Large-Scale Kernel Machines , 2007, NIPS.

[5]  John C. Duchi,et al.  Learning Kernels with Random Features , 2016, NIPS.

[6]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[7]  Wei Wang,et al.  Improving MMD-GAN Training with Repulsive Loss Function , 2018, ICLR.

[8]  Ravi Mazumdar,et al.  The Mean-field Behavior of Processor Sharing Systems with General Job Lengths Under the SQ(d) Policy , 2018, Perform. Evaluation.

[9]  Andrea Montanari,et al.  Mean-field theory of two-layers neural networks: dimension-free bounds and kernel limit , 2019, COLT.

[10]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[11]  Nello Cristianini,et al.  Learning the Kernel Matrix with Semidefinite Programming , 2002, J. Mach. Learn. Res..

[12]  R. Bass,et al.  Review: P. Billingsley, Convergence of probability measures , 1971 .

[13]  Stephen P. Boyd,et al.  Proximal Algorithms , 2013, Found. Trends Optim..

[14]  Mehryar Mohri,et al.  Algorithms for Learning Kernels Based on Centered Alignment , 2012, J. Mach. Learn. Res..

[15]  Yiming Yang,et al.  Implicit Kernel Learning , 2019, AISTATS.

[16]  Jonathan C. Mattingly,et al.  Scaling limits of a model for selection at two scales , 2015, Nonlinearity.

[17]  Yoshua Bengio,et al.  Mode Regularized Generative Adversarial Networks , 2016, ICLR.

[18]  Bernhard Schölkopf,et al.  A Kernel Method for the Two-Sample-Problem , 2006, NIPS.

[19]  Guangliang Chen,et al.  Simple, fast and accurate hyper-parameter tuning in Gaussian-kernel SVM , 2017, 2017 International Joint Conference on Neural Networks (IJCNN).

[20]  Sepp Hochreiter,et al.  GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium , 2017, NIPS.

[21]  Konstantinos Spiliopoulos,et al.  Mean Field Analysis of Neural Networks: A Law of Large Numbers , 2018, SIAM J. Appl. Math..

[22]  Michael I. Jordan,et al.  A Swiss Army Infinitesimal Jackknife , 2018, AISTATS.

[23]  Richard S. Zemel,et al.  Generative Moment Matching Networks , 2015, ICML.

[24]  Na Li,et al.  Stochastic Primal-Dual Method on Riemannian Manifolds of Bounded Sectional Curvature , 2017, 2017 16th IEEE International Conference on Machine Learning and Applications (ICMLA).

[25]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[26]  Philippe Robert Stochastic Networks and Queues , 2003 .

[27]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[28]  Yoram Baram,et al.  Learning by Kernel Polarization , 2005, Neural Computation.

[29]  Colin McDiarmid,et al.  Surveys in Combinatorics, 1989: On the method of bounded differences , 1989 .

[30]  Yiming Yang,et al.  MMD GAN: Towards Deeper Understanding of Moment Matching Network , 2017, NIPS.

[31]  J. Webb Extensions of Gronwall's inequality with quadratic growth terms and applications , 2018 .

[32]  F. Otto THE GEOMETRY OF DISSIPATIVE EVOLUTION EQUATIONS: THE POROUS MEDIUM EQUATION , 2001 .

[33]  Sebastian Nowozin,et al.  f-GAN: Training Generative Neural Samplers using Variational Divergence Minimization , 2016, NIPS.

[34]  Adel Javanmard,et al.  Analysis of a Two-Layer Neural Network via Displacement Convexity , 2019, The Annals of Statistics.

[35]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[36]  Lei Xing,et al.  Multiple Kernel Learning from $U$-Statistics of Empirical Measures in the Feature Space , 2019, ArXiv.

[37]  Grant M. Rotskoff,et al.  Neural Networks as Interacting Particle Systems: Asymptotic Convexity of the Loss Landscape and Universal Scaling of the Approximation Error , 2018, ArXiv.

[38]  Wojciech Zaremba,et al.  Improved Techniques for Training GANs , 2016, NIPS.

[39]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Sivaraman Balakrishnan,et al.  Optimal kernel choice for large-scale two-sample tests , 2012, NIPS.

[41]  Vahab S. Mirrokni,et al.  Approximate Leave-One-Out for Fast Parameter Tuning in High Dimensions , 2018, ICML.

[42]  AI Koan,et al.  Weighted Sums of Random Kitchen Sinks: Replacing minimization with randomization in learning , 2008, NIPS.

[43]  D. Kinderlehrer,et al.  THE VARIATIONAL FORMULATION OF THE FOKKER-PLANCK EQUATION , 1996 .

[44]  Yue M. Lu,et al.  Scaling Limit: Exact and Tractable Analysis of Online Learning Algorithms with Applications to Regularized Regression and PCA , 2017, ArXiv.

[45]  K. Spiliopoulos,et al.  Default clustering in large portfolios: Typical events. , 2011, 1104.1773.

[46]  Andrea Montanari,et al.  A mean field view of the landscape of two-layer neural networks , 2018, Proceedings of the National Academy of Sciences.

[47]  V. S. Varadarajan On a theorem of F. Riesz concerning the form of linear functionals , 1958 .

[48]  Gábor Lugosi,et al.  Concentration Inequalities - A Nonasymptotic Theory of Independence , 2013, Concentration Inequalities.

[49]  Ravi Mazumdar,et al.  The mean-field behavior of processor sharing systems with general job lengths under the SQ(d) policy , 2018, Perform. Evaluation.

[50]  Geoffrey E. Hinton,et al.  Deep Boltzmann Machines , 2009, AISTATS.

[51]  Hong Hu,et al.  A Solvable High-Dimensional Model of GAN , 2018, NeurIPS.

[52]  N. Cristianini,et al.  On Kernel-Target Alignment , 2001, NIPS.

[53]  Vahid Tarokh,et al.  On Data-Dependent Random Features for Improved Generalization in Supervised Learning , 2017, AAAI.

[54]  A. Kleywegt,et al.  Distributionally Robust Stochastic Optimization with Wasserstein Distance , 2016, Math. Oper. Res..

[55]  Barnabás Póczos,et al.  Minimax Distribution Estimation in Wasserstein Distance , 2018, ArXiv.