Few-shot learning of neural networks from scratch by pseudo example optimization

In this paper, we propose a simple but effective method for training neural networks with a limited amount of training data. Our approach inherits the idea of knowledge distillation that transfers knowledge from a deep or wide reference model to a shallow or narrow target model. The proposed method employs this idea to mimic predictions of reference estimators that are more robust against overfitting than the network we want to train. Different from almost all the previous work for knowledge distillation that requires a large amount of labeled training data, the proposed method requires only a small amount of training data. Instead, we introduce pseudo training examples that are optimized as a part of model parameters. Experimental results for several benchmark datasets demonstrate that the proposed method outperformed all the other baselines, such as naive training of the target model and standard knowledge distillation.

[1]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[2]  Rich Caruana,et al.  Learning Many Related Tasks at the Same Time with Backpropagation , 1994, NIPS.

[3]  Thad Starner,et al.  Data-Free Knowledge Distillation for Deep Neural Networks , 2017, ArXiv.

[4]  Zachary Chase Lipton,et al.  Born Again Neural Networks , 2018, ICML.

[5]  Carl E. Rasmussen,et al.  A Unifying View of Sparse Approximate Gaussian Process Regression , 2005, J. Mach. Learn. Res..

[6]  Sameer Singh,et al.  Generating Natural Adversarial Examples , 2017, ICLR.

[7]  Zoubin Ghahramani,et al.  Adversarial Examples, Uncertainty, and Transfer Testing Robustness in Gaussian Process Hybrid Deep Networks , 2017, 1707.02476.

[8]  Rich Caruana,et al.  Do Deep Nets Really Need to be Deep? , 2013, NIPS.

[9]  Wojciech Zaremba,et al.  Improved Techniques for Training GANs , 2016, NIPS.

[10]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Gregory R. Koch,et al.  Siamese Neural Networks for One-Shot Image Recognition , 2015 .

[12]  Neil D. Lawrence,et al.  Gaussian Processes for Big Data , 2013, UAI.

[13]  Bernhard Schölkopf,et al.  Fidelity-Weighted Learning , 2017, ICLR.

[14]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[15]  Dacheng Tao,et al.  Learning from Multiple Teacher Networks , 2017, KDD.

[16]  Zhihua Zhang,et al.  Bayesian Multicategory Support Vector Machines , 2006, UAI.

[17]  Rich Caruana,et al.  Model compression , 2006, KDD '06.

[18]  Manfred Opper,et al.  The Variational Gaussian Approximation Revisited , 2009, Neural Computation.

[19]  Luca Bertinetto,et al.  Learning feed-forward one-shot learners , 2016, NIPS.

[20]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[21]  James Hensman,et al.  Scalable Variational Gaussian Process Classification , 2014, AISTATS.

[22]  Max Welling,et al.  Semi-supervised Learning with Deep Generative Models , 2014, NIPS.

[23]  Kurt Hornik,et al.  Multilayer feedforward networks are universal approximators , 1989, Neural Networks.

[24]  Mingyan Liu,et al.  Spatially Transformed Adversarial Examples , 2018, ICLR.

[25]  Zoubin Ghahramani,et al.  Sparse Gaussian Processes using Pseudo-inputs , 2005, NIPS.

[26]  Shin Ishii,et al.  Distributional Smoothing by Virtual Adversarial Examples , 2015, ICLR.

[27]  Jorge Nocedal,et al.  Algorithm 778: L-BFGS-B: Fortran subroutines for large-scale bound-constrained optimization , 1997, TOMS.

[28]  Jonathon Shlens,et al.  Explaining and Harnessing Adversarial Examples , 2014, ICLR.

[29]  Mandar Kulkarni,et al.  Knowledge distillation using unlabeled mismatched images , 2017, ArXiv.

[30]  Oriol Vinyals,et al.  Matching Networks for One Shot Learning , 2016, NIPS.

[31]  Ananthram Swami,et al.  Distillation as a Defense to Adversarial Perturbations Against Deep Neural Networks , 2015, 2016 IEEE Symposium on Security and Privacy (SP).

[32]  Junmo Kim,et al.  A Gift from Knowledge Distillation: Fast Optimization, Network Minimization and Transfer Learning , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Michalis K. Titsias,et al.  Variational Learning of Inducing Variables in Sparse Gaussian Processes , 2009, AISTATS.

[34]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[35]  Patrick D. McDaniel,et al.  Transferability in Machine Learning: from Phenomena to Black-Box Attacks using Adversarial Samples , 2016, ArXiv.

[36]  Roland Vollgraf,et al.  Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms , 2017, ArXiv.

[37]  Alexis Boukouvalas,et al.  GPflow: A Gaussian Process Library using TensorFlow , 2016, J. Mach. Learn. Res..

[38]  Zoubin Ghahramani,et al.  Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning , 2015, ICML.

[39]  Timothy Dozat,et al.  Incorporating Nesterov Momentum into Adam , 2016 .

[40]  Harri Valpola,et al.  Weight-averaged consistency targets improve semi-supervised deep learning results , 2017, ArXiv.

[41]  Yoshua Bengio,et al.  FitNets: Hints for Thin Deep Nets , 2014, ICLR.