Self-training Converts Weak Learners to Strong Learners in Mixture Models

We consider a binary classification problem when the data comes from a mixture of two rotationally symmetric distributions satisfying concentration and anti-concentration properties enjoyed by log-concave distributions among others. We show that there exists a universal constant Cerr > 0 such that if a pseudolabeler βpl can achieve classification error at most Cerr, then for any ε > 0, an iterative self-training algorithm initialized at β0 := βpl using pseudolabels ŷ = sgn(〈βt,x〉) and using at most Õ(d/ε) unlabeled examples suffices to learn the Bayes-optimal classifier up to ε error, where d is the ambient dimension. That is, self-training converts weak learners to strong learners using only unlabeled examples. We additionally show that by running gradient descent on the logistic loss one can obtain a pseudolabeler βpl with classification error Cerr using only O(d) labeled examples (i.e., independent of ε). Together our results imply that mixture models can be learned to within ε of the Bayes-optimal accuracy using at most O(d) labeled examples and Õ(d/ε) unlabeled examples by way of a semi-supervised self-training algorithm.

[1]  J. Lee,et al.  Predicting What You Already Know Helps: Provable Self-Supervised Learning , 2020, NeurIPS.

[2]  Tengyu Ma,et al.  Understanding Self-Training for Gradual Domain Adaptation , 2020, ICML.

[3]  Yuan Cao,et al.  Provable Generalization of SGD-trained Neural Networks of Any Width in the Presence of Adversarial Label Noise , 2021, ICML.

[4]  Christos Tzamos,et al.  Non-Convex SGD Learns Halfspaces with Adversarial Label Noise , 2020, NeurIPS.

[5]  Quanquan Gu,et al.  Agnostic Learning of Halfspaces with Gradient Descent via Soft Margins , 2020, ICML.

[6]  Shai Ben-David,et al.  When can unlabeled data improve the learning rate? , 2019, COLT.

[7]  Quoc V. Le,et al.  Meta Pseudo Labels , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Roman Vershynin,et al.  Introduction to the non-asymptotic analysis of random matrices , 2010, Compressed Sensing.

[9]  Geoffrey E. Hinton,et al.  A Simple Framework for Contrastive Learning of Visual Representations , 2020, ICML.

[10]  Christos Tzamos,et al.  Learning Halfspaces with Massart Noise Under Structured Distributions , 2020, COLT 2020.

[11]  Dong-Hyun Lee,et al.  Pseudo-Label : The Simple and Efficient Semi-Supervised Learning Method for Deep Neural Networks , 2013 .

[12]  Tengyu Ma,et al.  In-N-Out: Pre-Training and Self-Training using Auxiliary Information for Out-of-Distribution Robustness , 2020, ICLR.

[13]  Xiaojin Zhu,et al.  Semi-Supervised Learning , 2010, Encyclopedia of Machine Learning.

[14]  Robert D. Nowak,et al.  Unlabeled data: Now it helps, now it doesn't , 2008, NIPS.

[15]  Xiaojin Zhu,et al.  Introduction to Semi-Supervised Learning , 2009, Synthesis Lectures on Artificial Intelligence and Machine Learning.

[16]  David Yarowsky,et al.  Unsupervised Word Sense Disambiguation Rivaling Supervised Methods , 1995, ACL.

[17]  John Duchi,et al.  Understanding and Mitigating the Tradeoff Between Robustness and Accuracy , 2020, ICML.

[18]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[19]  Matus Telgarsky,et al.  Generalization bounds via distillation , 2021, ICLR.

[20]  Vittorio Castelli,et al.  On the exponential value of labeled samples , 1995, Pattern Recognit. Lett..

[21]  Quanquan Gu,et al.  Provable Robustness of Adversarial Training for Learning Halfspaces with Noise , 2021, ICML.

[22]  Christos Tzamos,et al.  Distribution-Independent PAC Learning of Halfspaces with Massart Noise , 2019, NeurIPS.

[23]  T. Cover,et al.  The relative value of labeled and unlabeled samples in pattern recognition , 1993, Proceedings. IEEE International Symposium on Information Theory.

[24]  Hans Ulrich Simon,et al.  Unlabeled Data Does Provably Help , 2013, STACS.

[25]  Pradeep Ravikumar,et al.  Minimax Gaussian Classification & Clustering , 2017, AISTATS.

[26]  Colin Wei,et al.  Self-training Avoids Using Spurious Features Under Domain Shift , 2020, NeurIPS.

[27]  Akshay Krishnamurthy,et al.  Contrastive learning, multi-view redundancy, and linear models , 2020, ALT.

[28]  Ruiqi Gao,et al.  A Theory of Label Propagation for Subpopulation Shift , 2021, ICML.

[29]  Maria-Florina Balcan,et al.  A discriminative model for semi-supervised learning , 2010, J. ACM.

[30]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[31]  H. J. Scudder,et al.  Probability of error of some adaptive pattern-recognition machines , 1965, IEEE Trans. Inf. Theory.

[32]  Xiaofeng Liu,et al.  Confidence Regularized Self-Training , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[33]  Mubarak Shah,et al.  In Defense of Pseudo-Labeling: An Uncertainty-Aware Pseudo-label Selection Framework for Semi-Supervised Learning , 2021, ICLR.

[34]  Colin Wei,et al.  Theoretical Analysis of Self-Training with Deep Networks on Unlabeled Data , 2020, ICLR.

[35]  Samet Oymak,et al.  Statistical and Algorithmic Insights for Semi-supervised Learning with Self-training , 2020, ArXiv.