Cost-Sensitive Self-Training for Optimizing Non-Decomposable Metrics

Self-training based semi-supervised learning algorithms have enabled the learning of highly accurate deep neural networks, using only a fraction of labeled data. However, the majority of work on self-training has focused on the objective of improving accuracy, whereas practical machine learning systems can have complex goals (e.g. maximizing the minimum of recall across classes, etc.) that are non-decomposable in nature. In this work, we introduce the Cost-Sensitive Self-Training (CSST) framework which generalizes the self-training-based methods for optimizing non-decomposable metrics. We prove that our framework can better optimize the desired non-decomposable metric utilizing unlabeled data, under similar data distribution assumptions made for the analysis of self-training. Using the proposed CSST framework, we obtain practical self-training methods (for both vision and NLP tasks) for optimizing different non-decomposable metrics using deep neural networks. Our results demonstrate that CSST achieves an improvement over the state-of-the-art in majority of the cases across datasets and objectives.

[1]  T. Shinozaki,et al.  FlexMatch: Boosting Semi-Supervised Learning with Curriculum Pseudo Labeling , 2021, NeurIPS.

[2]  Harikrishna Narasimhan,et al.  Implicit rate-constrained optimization of non-decomposable objectives , 2021, ICML.

[3]  Harikrishna Narasimhan,et al.  Training Over-parameterized Models with Non-decomposable Objectives , 2021, NeurIPS.

[4]  A. Yuille,et al.  CReST: A Class-Rebalancing Self-Training Framework for Imbalanced Semi-Supervised Learning , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Colin Wei,et al.  Theoretical Analysis of Self-Training with Deep Networks on Unlabeled Data , 2020, ICLR.

[6]  Sung Ju Hwang,et al.  Distribution Aligning Refinery of Pseudo-label for Imbalanced Semi-supervised Learning , 2020, NeurIPS.

[7]  Ankit Singh Rawat,et al.  Long-tail learning via logit adjustment , 2020, ICLR.

[8]  Quoc V. Le,et al.  Meta Pseudo Labels , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  David Berthelot,et al.  FixMatch: Simplifying Semi-Supervised Learning with Consistency and Confidence , 2020, NeurIPS.

[10]  Quoc V. Le,et al.  Self-Training With Noisy Student Improves ImageNet Classification , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Colin Wei,et al.  Improved Sample Complexities for Deep Networks and Robust Classification via an All-Layer Margin , 2019, ICLR.

[12]  R'emi Louf,et al.  HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[13]  Thomas Wolf,et al.  DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter , 2019, ArXiv.

[14]  Xiaofeng Liu,et al.  Confidence Regularized Self-Training , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[15]  Colin Wei,et al.  Learning Imbalanced Datasets with Label-Distribution-Aware Margin Loss , 2019, NeurIPS.

[16]  Carlos Guestrin,et al.  Addressing the Loss-Metric Mismatch with Adaptive Loss Alignment , 2019, ICML.

[17]  David Berthelot,et al.  MixMatch: A Holistic Approach to Semi-Supervised Learning , 2019, NeurIPS.

[18]  Quoc V. Le,et al.  Unsupervised Data Augmentation for Consistency Training , 2019, NeurIPS.

[19]  Mehryar Mohri,et al.  Agnostic Federated Learning , 2019, ICML.

[20]  Yang Song,et al.  Class-Balanced Loss Based on Effective Number of Samples , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Maya R. Gupta,et al.  Optimization with Non-Differentiable Constraints with Applications to Fairness, Recall, Churn, and Other Goals , 2018, J. Mach. Learn. Res..

[22]  André F. T. Martins,et al.  Marian: Fast Neural Machine Translation in C++ , 2018, ACL.

[23]  Fabrizio Sebastiani,et al.  Optimizing non-decomposable measures with deep networks , 2018, Machine Learning.

[24]  Shin Ishii,et al.  Virtual Adversarial Training: A Regularization Method for Supervised and Semi-Supervised Learning , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[25]  Tatsuya Harada,et al.  Asymmetric Tri-training for Unsupervised Domain Adaptation , 2017, ICML.

[26]  Timo Aila,et al.  Temporal Ensembling for Semi-Supervised Learning , 2016, ICLR.

[27]  Richard Nock,et al.  Making Deep Neural Networks Robust to Label Noise: A Loss Correction Approach , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Maya R. Gupta,et al.  Satisfying Real-world Goals with Dataset Constraints , 2016, NIPS.

[29]  Oluwasanmi Koyejo,et al.  Optimal Classification with Multivariate Losses , 2016, ICML.

[30]  Nikos Komodakis,et al.  Wide Residual Networks , 2016, BMVC.

[31]  Michael S. Bernstein,et al.  Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , 2016, International Journal of Computer Vision.

[32]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Harikrishna Narasimhan,et al.  Consistent Multiclass Algorithms for Complex Performance Measures , 2015, ICML.

[34]  Harikrishna Narasimhan,et al.  On the Statistical Consistency of Plug-in Classifiers for Non-decomposable Performance Measures , 2014, NIPS.

[35]  Yves Grandvalet,et al.  Optimizing F-Measures by Cost-Sensitive Classification , 2014, NIPS.

[36]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[37]  Max Welling,et al.  Semi-supervised Learning with Deep Generative Models , 2014, NIPS.

[38]  Christopher Potts,et al.  Learning Word Vectors for Sentiment Analysis , 2011, ACL.

[39]  John Langford,et al.  Cost-sensitive learning by cost-proportionate example weighting , 2003, Third IEEE International Conference on Data Mining.

[40]  Marco Chiani,et al.  New exponential bounds and approximations for the computation of error probability in fading channels , 2003, IEEE Trans. Wirel. Commun..

[41]  Yi Lin,et al.  Support Vector Machines for Classification in Nonstandard Situations , 2002, Machine Learning.

[42]  Harikrishna Narasimhan,et al.  Consistent Plug-in Classifiers for Complex Objectives and Constraints , 2020, NeurIPS.

[43]  Jens Lehmann,et al.  DBpedia - A large-scale, multilingual knowledge base extracted from Wikipedia , 2015, Semantic Web.

[44]  Ji-Rong Wen,et al.  Semi-Supervised Learning , 2014 .

[45]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[46]  Zien,et al.  Semi-Supervised Learning , 2009 .

[47]  S. Bobkov An isoperimetric inequality on the discrete cube, and an elementary proof of the isoperimetric inequality in Gauss space , 1997 .