KDGAN: Knowledge Distillation with Generative Adversarial Networks

Knowledge distillation (KD) aims to train a lightweight classifier suitable to provide accurate inference with constrained resources in multi-label learning. Instead of directly consuming feature-label pairs, the classifier is trained by a teacher, i.e., a high-capacity model whose training may be resource-hungry. The accuracy of the classifier trained this way is usually suboptimal because it is difficult to learn the true data distribution from the teacher. An alternative method is to adversarially train the classifier against a discriminator in a two-player game akin to generative adversarial networks (GAN), which can ensure the classifier to learn the true data distribution at the equilibrium of this game. However, it may take excessively long time for such a two-player game to reach equilibrium due to high-variance gradient updates. To address these limitations, we propose a three-player game named KDGAN consisting of a classifier, a teacher, and a discriminator. The classifier and the teacher learn from each other via distillation losses and are adversarially trained against the discriminator via adversarial losses. By simultaneously optimizing the distillation and adversarial losses, the classifier will learn the true data distribution at the equilibrium. We approximate the discrete distribution learned by the classifier (or the teacher) with a concrete distribution. From the concrete distribution, we generate continuous samples to obtain low-variance gradient updates, which speed up the training. Extensive experiments using real datasets confirm the superiority of KDGAN in both accuracy and training speed.

[1]  Geoffrey E. Hinton,et al.  Learning internal representations by error propagation , 1986 .

[2]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Nicholas Jing Yuan,et al.  Collaborative Intent Prediction with Real-Time Contextual Data , 2017, ACM Trans. Inf. Syst..

[4]  Jascha Sohl-Dickstein,et al.  REBAR: Low-variance, unbiased gradient estimates for discrete latent variable models , 2017, NIPS.

[5]  Slobodan Vucetic,et al.  Supervised clustering of label ranking data using label preference information , 2013, Machine Learning.

[6]  Rauf Izmailov,et al.  Learning using privileged information: similarity control and knowledge transfer , 2015, J. Mach. Learn. Res..

[7]  Jorge Nocedal,et al.  Optimization Methods for Large-Scale Machine Learning , 2016, SIAM Rev..

[8]  Cordelia Schmid,et al.  TagProp: Discriminative metric learning in nearest neighbor models for image auto-annotation , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[9]  Jiawei Huang,et al.  Learning Loss for Knowledge Distillation with Conditional Adversarial Networks , 2017, ArXiv.

[10]  Xirong Li,et al.  Classifying tag relevance with relevant positive and negative examples , 2013, ACM Multimedia.

[11]  Léon Bottou,et al.  Towards Principled Methods for Training Generative Adversarial Networks , 2017, ICLR.

[12]  Vladimir Pavlovic,et al.  Baselines for Image Annotation , 2010, International Journal of Computer Vision.

[13]  Bernhard Schölkopf,et al.  Unifying distillation and privileged information , 2015, ICLR.

[14]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[15]  Rich Caruana,et al.  Do Deep Nets Really Need to be Deep? , 2013, NIPS.

[16]  Wojciech Zaremba,et al.  Improved Techniques for Training GANs , 2016, NIPS.

[17]  Léon Bottou,et al.  Wasserstein GAN , 2017, ArXiv.

[18]  E. Gumbel Statistical Theory of Extreme Values and Some Practical Applications : A Series of Lectures , 1954 .

[19]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[20]  Zhe Gan,et al.  Triangle Generative Adversarial Networks , 2017, NIPS.

[21]  Ben Poole,et al.  Categorical Reparameterization with Gumbel-Softmax , 2016, ICLR.

[22]  Fei Xia,et al.  Understanding GANs: the LQG Setting , 2017, ArXiv.

[23]  Jitendra Malik,et al.  Cross Modal Distillation for Supervision Transfer , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Geoffrey E. Hinton,et al.  Large scale distributed neural network training through online distillation , 2018, ICLR.

[25]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[26]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[27]  Bo Li,et al.  A Joint Optimization Approach for Personalized Recommendation Diversification , 2018, PAKDD.

[28]  Tony X. Han,et al.  Learning Efficient Object Detection Models with Knowledge Distillation , 2017, NIPS.

[29]  Ivor W. Tsang,et al.  Tag-Based Image Retrieval Improved by Augmented Features and Group-Based Refinement , 2012, IEEE Transactions on Multimedia.

[30]  Yee Whye Teh,et al.  The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables , 2016, ICLR.

[31]  Margaret Mitchell,et al.  VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[32]  David Pfau,et al.  Unrolled Generative Adversarial Networks , 2016, ICLR.

[33]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[34]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[35]  Nicholas Jing Yuan,et al.  Contextual Intent Tracking for Personal Assistants , 2016, KDD.

[36]  David A. Shamma,et al.  YFCC100M , 2015, Commun. ACM.

[37]  Vineeth N. Balasubramanian,et al.  Deep Model Compression: Distilling Knowledge from Noisy Teachers , 2016, ArXiv.

[38]  Yann LeCun,et al.  Energy-based Generative Adversarial Network , 2016, ICLR.

[39]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[40]  Yingyu Liang,et al.  Generalization and Equilibrium in Generative Adversarial Nets (GANs) , 2017, ICML.

[41]  Eyke Hüllermeier,et al.  Label Ranking Methods based on the Plackett-Luce Model , 2010, ICML.

[42]  Lantao Yu,et al.  SeqGAN: Sequence Generative Adversarial Nets with Policy Gradient , 2016, AAAI.

[43]  Zhe Gan,et al.  Generating Text via Adversarial Training , 2016 .

[44]  Yoshua Bengio,et al.  FitNets: Hints for Thin Deep Nets , 2014, ICLR.

[45]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[46]  Tom Minka,et al.  A* Sampling , 2014, NIPS.

[47]  Zhi Chen,et al.  Adversarial Feature Matching for Text Generation , 2017, ICML.

[48]  Vladimir Vapnik,et al.  A new learning paradigm: Learning using privileged information , 2009, Neural Networks.

[49]  Rich Caruana,et al.  Model compression , 2006, KDD '06.

[50]  Vladimir Vapnik,et al.  On the Theory of Learnining with Privileged Information , 2010, NIPS.

[51]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[52]  Nicholas Jing Yuan,et al.  Collaborative Nowcasting for Contextual Recommendation , 2016, WWW.

[53]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[54]  Min-Ling Zhang,et al.  A Review on Multi-Label Learning Algorithms , 2014, IEEE Transactions on Knowledge and Data Engineering.

[55]  Peng Zhang,et al.  IRGAN: A Minimax Game for Unifying Generative and Discriminative Information Retrieval Models , 2017, SIGIR.

[56]  Jun Zhu,et al.  Triple Generative Adversarial Nets , 2017, NIPS.