Classification Accuracy Score for Conditional Generative Models

Deep generative models (DGMs) of images are now sufficiently mature that they produce nearly photorealistic samples and obtain scores similar to the data distribution on heuristics such as Frechet Inception Distance (FID). These results, especially on large-scale datasets such as ImageNet, suggest that DGMs are learning the data distribution in a perceptually meaningful space and can be used in downstream tasks. To test this latter hypothesis, we use class-conditional generative models from a number of model classes---variational autoencoders, autoregressive models, and generative adversarial networks (GANs)---to infer the class labels of real data. We perform this inference by training an image classifier using only synthetic data and using the classifier to predict labels on real data. The performance on this task, which we call Classification Accuracy Score (CAS), reveals some surprising results not identified by traditional metrics and constitute our contributions. First, when using a state-of-the-art GAN (BigGAN-deep), Top-1 and Top-5 accuracy decrease by 27.9\% and 41.6\%, respectively, compared to the original data; and conditional generative models from other model classes, such as Vector-Quantized Variational Autoencoder-2 (VQ-VAE-2) and Hierarchical Autoregressive Models (HAMs), substantially outperform GANs on this benchmark. Second, CAS automatically surfaces particular classes for which generative models failed to capture the data distribution, and were previously unknown in the literature. Third, we find traditional GAN metrics such as Inception Score (IS) and FID neither predictive of CAS nor useful when evaluating non-GAN models. Furthermore, in order to facilitate better diagnoses of generative models, we open-source the proposed metric.

[1]  Paul Smolensky,et al.  Information processing in dynamical systems: foundations of harmony theory , 1986 .

[2]  Stanley F. Chen,et al.  Evaluation Metrics For Language Models , 1998 .

[3]  D. Stojanoff,et al.  Convergence of iterated Aluthge transform sequence for diagonalizable matrices II: $\lambda$-Aluthge transform , 2007, 0706.1234.

[4]  A. Raftery,et al.  Strictly Proper Scoring Rules, Prediction, and Estimation , 2007 .

[5]  Bernhard Schölkopf,et al.  A Kernel Two-Sample Test , 2012, J. Mach. Learn. Res..

[6]  Daan Wierstra,et al.  Stochastic Backpropagation and Approximate Inference in Deep Generative Models , 2014, ICML.

[7]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[8]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[9]  Alex Graves,et al.  Conditional Image Generation with PixelCNN Decoders , 2016, NIPS.

[10]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Matthias Bethge,et al.  A note on the evaluation of generative models , 2015, ICLR.

[12]  Wojciech Zaremba,et al.  Improved Techniques for Training GANs , 2016, NIPS.

[13]  Samy Bengio,et al.  Density estimation using Real NVP , 2016, ICLR.

[14]  Kaiming He,et al.  Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour , 2017, ArXiv.

[15]  Sepp Hochreiter,et al.  GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium , 2017, NIPS.

[16]  Eric P. Xing,et al.  Toward Controlled Generation of Text , 2017, ICML.

[17]  Dhruv Batra,et al.  LR-GAN: Layered Recursive Generative Adversarial Networks for Image Generation , 2016, ICLR.

[18]  Gunnar Rätsch,et al.  Real-valued (Medical) Time Series Generation with Recurrent Conditional GANs , 2017, ArXiv.

[19]  Yi Zhang,et al.  Do GANs actually learn the distribution? An empirical study , 2017, ArXiv.

[20]  Aleksander Madry,et al.  A Classification-Based Study of Covariate Shift in GAN Distributions , 2017, ICML.

[21]  Olivier Bachem,et al.  Assessing Generative Models via Precision and Recall , 2018, NeurIPS.

[22]  Stanislau Semeniuta,et al.  On Accurate Evaluation of GANs for Language Generation , 2018, ArXiv.

[23]  Prafulla Dhariwal,et al.  Glow: Generative Flow with Invertible 1x1 Convolutions , 2018, NeurIPS.

[24]  Valentin Khrulkov,et al.  Geometry Score: A Method For Comparing Generative Adversarial Networks , 2018, ICML.

[25]  Nicoló Fusi,et al.  Model Compression with Generative Adversarial Networks , 2018, ArXiv.

[26]  Jaakko Lehtinen,et al.  Progressive Growing of GANs for Improved Quality, Stability, and Variation , 2017, ICLR.

[27]  Rémi Munos,et al.  Autoregressive Quantile Networks for Generative Modeling , 2018, ICML.

[28]  Cordelia Schmid,et al.  How good is my GAN? , 2018, ECCV.

[29]  Arthur Gretton,et al.  Demystifying MMD GANs , 2018, ICLR.

[30]  Takeru Miyato,et al.  cGANs with Projection Discriminator , 2018, ICLR.

[31]  Alexei A. Efros,et al.  The Unreasonable Effectiveness of Deep Features as a Perceptual Metric , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[32]  Rishi Sharma,et al.  A Note on the Inception Score , 2018, ArXiv.

[33]  Ali Razavi,et al.  Generating Diverse High-Resolution Images with VQ-VAE , 2019, DGS@ICLR.

[34]  Jeff Donahue,et al.  Large Scale GAN Training for High Fidelity Natural Image Synthesis , 2018, ICLR.

[35]  Lester W. Mackey,et al.  Teacher-Student Compression with Generative Adversarial Networks , 2018, 1812.02271.

[36]  Yee Whye Teh,et al.  Do Deep Generative Models Know What They Don't Know? , 2018, ICLR.

[37]  Ali Razavi,et al.  Generating Diverse High-Fidelity Images with VQ-VAE-2 , 2019, NeurIPS.

[38]  Michael S. Bernstein,et al.  HYPE: Human eYe Perceptual Evaluation of Generative Models , 2019, DGS@ICLR.

[39]  Nal Kalchbrenner,et al.  Generating High Fidelity Images with Subscale Pixel Networks and Multidimensional Upscaling , 2018, ICLR.

[40]  David Filliat,et al.  Training Discriminative Models to Evaluate Generative Ones , 2019, ICANN.

[41]  Timo Aila,et al.  A Style-Based Generator Architecture for Generative Adversarial Networks , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[42]  Jaakko Lehtinen,et al.  Improved Precision and Recall Metric for Assessing Generative Models , 2019, NeurIPS.

[43]  Michael S. Bernstein,et al.  HYPE: A Benchmark for Human eYe Perceptual Evaluation of Generative Models , 2019, NeurIPS.

[44]  Karen Simonyan,et al.  Hierarchical Autoregressive Image Models with Auxiliary Decoders , 2019, ArXiv.