There Are Many Consistent Explanations of Unlabeled Data: Why You Should Average

Presently the most successful approaches to semi-supervised learning are based on consistency regularization, whereby a model is trained to be robust to small perturbations of its inputs and parameters. To understand consistency regularization, we conceptually explore how loss geometry interacts with training procedures. The consistency loss dramatically improves generalization performance over supervised-only training; however, we show that SGD struggles to converge on the consistency loss and continues to make large steps that lead to changes in predictions on the test data. Motivated by these observations, we propose to train consistency-based methods with Stochastic Weight Averaging (SWA), a recent approach which averages weights along the trajectory of SGD with a modified learning rate schedule. We also propose fast-SWA, which further accelerates convergence by averaging multiple points within each cycle of a cyclical learning rate schedule. With weight averaging, we achieve the best known semi-supervised results on CIFAR-10 and CIFAR-100, over many different quantities of labeled training data. For example, we achieve 5.0% error on CIFAR-10 with only 4000 labels, compared to the previous best result in the literature of 6.3%.

[1]  Tolga Tasdizen,et al.  Regularization With Stochastic Transformations and Perturbations for Deep Semi-Supervised Learning , 2016, NIPS.

[2]  Colin Raffel,et al.  Realistic Evaluation of Semi-Supervised Learning Algorithms , 2018, ICLR.

[3]  Jorge Nocedal,et al.  On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima , 2016, ICLR.

[4]  Philip Bachman,et al.  Learning with Pseudo-Ensembles , 2014, NIPS.

[5]  Stefano Ermon,et al.  A DIRT-T Approach to Unsupervised Domain Adaptation , 2018, ICLR.

[6]  Bo Zhang,et al.  Smooth Neighbors on Teacher Graphs for Semi-Supervised Learning , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[7]  Stefan Winkler,et al.  The Unusual Effectiveness of Averaging in GAN Training , 2018, ICLR.

[8]  Timo Aila,et al.  Temporal Ensembling for Semi-Supervised Learning , 2016, ICLR.

[9]  Zoubin Ghahramani,et al.  Combining active learning and semi-supervised learning using Gaussian fields and harmonic functions , 2003, ICML 2003.

[10]  Frank Hutter,et al.  SGDR: Stochastic Gradient Descent with Warm Restarts , 2016, ICLR.

[11]  Ethan Dyer,et al.  Gradient Descent Happens in a Tiny Subspace , 2018, ArXiv.

[12]  Xavier Gastaldi,et al.  Shake-Shake regularization , 2017, ArXiv.

[13]  Harri Valpola,et al.  Weight-averaged consistency targets improve semi-supervised deep learning results , 2017, ArXiv.

[14]  Yi Zhang,et al.  Stronger generalization bounds for deep nets via a compression approach , 2018, ICML.

[15]  Jürgen Schmidhuber,et al.  Flat Minima , 1997, Neural Computation.

[16]  Il-Chul Moon,et al.  Adversarial Dropout for Supervised and Semi-supervised Learning , 2017, AAAI.

[17]  Geoffrey French,et al.  Self-ensembling for visual domain adaptation , 2017, ICLR.

[18]  Colin Raffel,et al.  Realistic Evaluation of Deep Semi-Supervised Learning Algorithms , 2018, NeurIPS.

[19]  Razvan Pascanu,et al.  Sharp Minima Can Generalize For Deep Nets , 2017, ICML.

[20]  Shin Ishii,et al.  Virtual Adversarial Training: A Regularization Method for Supervised and Semi-Supervised Learning , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[21]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[23]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[24]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[25]  Oriol Vinyals,et al.  Qualitatively characterizing neural network optimization problems , 2014, ICLR.

[26]  Stefano Soatto,et al.  Entropy-SGD: biasing gradient descent into wide valleys , 2016, ICLR.

[27]  Yann Dauphin,et al.  Empirical Analysis of the Hessian of Over-Parametrized Neural Networks , 2017, ICLR.

[28]  Antonio Torralba,et al.  Ieee Transactions on Pattern Analysis and Machine Intelligence 1 80 Million Tiny Images: a Large Dataset for Non-parametric Object and Scene Recognition , 2022 .

[29]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[30]  David M. Blei,et al.  Stochastic Gradient Descent as Approximate Bayesian Inference , 2017, J. Mach. Learn. Res..

[31]  Jascha Sohl-Dickstein,et al.  Sensitivity and Generalization in Neural Networks: an Empirical Study , 2018, ICLR.

[32]  Guillermo Sapiro,et al.  Robust Large Margin Deep Neural Networks , 2016, IEEE Transactions on Signal Processing.

[33]  Andrew Gordon Wilson,et al.  Averaging Weights Leads to Wider Optima and Better Generalization , 2018, UAI.

[34]  Frank Hutter,et al.  SGDR: Stochastic Gradient Descent with Restarts , 2016, ArXiv.

[35]  Sivan Toledo,et al.  Randomized algorithms for estimating the trace of an implicit symmetric positive semi-definite matrix , 2011, JACM.