Zero-shot Knowledge Transfer via Adversarial Belief Matching

Performing knowledge transfer from a large teacher network to a smaller student is a popular task in modern deep learning applications. However, due to growing dataset sizes and stricter privacy regulations, it is increasingly common not to have access to the data that was used to train the teacher. We propose a novel method which trains a student to match the predictions of its teacher without using any data or metadata. We achieve this by training an adversarial generator to search for images on which the student poorly matches the teacher, and then using them to train the student. Our resulting student closely approximates its teacher for simple datasets like SVHN, and on CIFAR10 we improve on the state-of-the-art for few-shot distillation (with 100 images per class), despite using no data. Finally, we also propose a metric to quantify the degree of belief matching between teacher and student in the vicinity of decision boundaries, and observe a significantly higher match between our zero-shot student and the teacher, than between a student distilled with real data and the teacher. Code available at: this https URL

[1]  Colin Raffel,et al.  Realistic Evaluation of Deep Semi-Supervised Learning Algorithms , 2018, NeurIPS.

[2]  Yoshua Bengio,et al.  FitNets: Hints for Thin Deep Nets , 2014, ICLR.

[3]  Pritish Narayanan,et al.  Deep Learning with Limited Numerical Precision , 2015, ICML.

[4]  Anthony J. Brookes,et al.  Data Safe Havens in health research and healthcare , 2015, Bioinform..

[5]  Ran El-Yaniv,et al.  Quantized Neural Networks: Training Neural Networks with Low Precision Weights and Activations , 2016, J. Mach. Learn. Res..

[6]  Bernhard Schölkopf,et al.  Fidelity-Weighted Learning , 2017, ICLR.

[7]  Neil D. Lawrence,et al.  Variational Information Distillation for Knowledge Transfer , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Koh Takeuchi,et al.  Few-shot learning of neural networks from scratch by pseudo example optimization , 2018, BMVC.

[9]  R. Venkatesh Babu,et al.  Zero-Shot Knowledge Distillation in Deep Networks , 2019, ICML.

[10]  Hanan Samet,et al.  Pruning Filters for Efficient ConvNets , 2016, ICLR.

[11]  Rich Caruana,et al.  Model compression , 2006, KDD '06.

[12]  Somesh Jha,et al.  Model Inversion Attacks that Exploit Confidence Information and Basic Countermeasures , 2015, CCS.

[13]  Razvan Pascanu,et al.  Policy Distillation , 2015, ICLR.

[14]  Michalis K. Titsias,et al.  Variational Learning of Inducing Variables in Sparse Gaussian Processes , 2009, AISTATS.

[15]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[16]  Somesh Jha,et al.  Privacy in Pharmacogenetics: An End-to-End Case Study of Personalized Warfarin Dosing , 2014, USENIX Security Symposium.

[17]  Thad Starner,et al.  Data-Free Knowledge Distillation for Deep Neural Networks , 2017, ArXiv.

[18]  Koh Takeuchi,et al.  Imitation networks: Few-shot learning of neural networks from scratch , 2018, ArXiv.

[19]  George Kurian,et al.  Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.

[20]  Yoshua Bengio,et al.  Zero-data Learning of New Tasks , 2008, AAAI.

[21]  Alexei A. Efros,et al.  Dataset Distillation , 2018, ArXiv.

[22]  Changshui Zhang,et al.  Knowledge Distillation from Few Samples , 2018, ArXiv.

[23]  Ming Yang,et al.  DeepFace: Closing the Gap to Human-Level Performance in Face Verification , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[24]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Nikos Komodakis,et al.  Wide Residual Networks , 2016, BMVC.

[26]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[27]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[28]  Carl E. Rasmussen,et al.  A Unifying View of Sparse Approximate Gaussian Process Regression , 2005, J. Mach. Learn. Res..

[29]  Colin Raffel,et al.  Realistic Evaluation of Semi-Supervised Learning Algorithms , 2018, ICLR.

[30]  Andrew Y. Ng,et al.  Zero-Shot Learning Through Cross-Modal Transfer , 2013, NIPS.

[31]  Rich Caruana,et al.  Do Deep Nets Really Need to be Deep? , 2013, NIPS.

[32]  Matthias Bethge,et al.  ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness , 2018, ICLR.

[33]  Chen Sun,et al.  Revisiting Unreasonable Effectiveness of Data in Deep Learning Era , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[34]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[35]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[36]  Fan Zhang,et al.  Stealing Machine Learning Models via Prediction APIs , 2016, USENIX Security Symposium.

[37]  Song Han,et al.  Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding , 2015, ICLR.

[38]  Andrew Y. Ng,et al.  Reading Digits in Natural Images with Unsupervised Feature Learning , 2011 .

[39]  Nikos Komodakis,et al.  Paying More Attention to Attention: Improving the Performance of Convolutional Neural Networks via Attention Transfer , 2016, ICLR.

[40]  Zoubin Ghahramani,et al.  Sparse Gaussian Processes using Pseudo-inputs , 2005, NIPS.

[41]  Amos J. Storkey,et al.  Moonshine: Distilling with Cheap Convolutions , 2017, NeurIPS.

[42]  Aleksander Madry,et al.  Adversarial Examples Are Not Bugs, They Are Features , 2019, NeurIPS.

[43]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .