Detecting Errors and Estimating Accuracy on Unlabeled Data with Self-training Ensembles

When a deep learning model is deployed in the wild, it can encounter test data drawn from distributions different from the training data distribution and suffer drop in performance. For safe deployment, it is essential to estimate the accuracy of the pre-trained model on the test data. However, the labels for the test inputs are usually not immediately available in practice, and obtaining them can be expensive. This observation leads to two challenging tasks: (1) unsupervised accuracy estimation, which aims to estimate the accuracy of a pre-trained classifier on a set of unlabeled test inputs; (2) error detection, which aims to identify mis-classified test inputs. In this paper, we propose a principled and practically effective framework that simultaneously addresses the two tasks. The proposed framework iteratively learns an ensemble of models to identify mis-classified data points and performs self-training to improve the ensemble with the identified points. Theoretical analysis demonstrates that our framework enjoys provable guarantees for both accuracy estimation and error detection under mild conditions readily satisfied by practical deep learning models. Along with the framework, we proposed and experimented with two instantiations and achieved state-of-the-art results on 59 tasks. For example, on iWildCam, one instantiation reduces the estimation error for unsupervised accuracy estimation by at least 70% and improves the F1 score for error detection by at least 4.7% compared to existing methods. ∗Part of the work done while interning at Google. Our code is available at: https://github.com/jfc43/self-training-ensembles. 35th Conference on Neural Information Processing Systems (NeurIPS 2021).

[1]  Stephen Gould,et al.  What Does Rotation Prediction Tell Us about Classifier Accuracy under Varying Testing Environments? , 2021, ICML.

[2]  Tom M. Mitchell,et al.  Estimating Accuracy from Unlabeled Data: A Probabilistic Logic Approach , 2017, NIPS.

[3]  Kevin Gimpel,et al.  A Baseline for Detecting Misclassified and Out-of-Distribution Examples in Neural Networks , 2016, ICLR.

[4]  Sara Beery,et al.  The iWildCam 2020 Competition Dataset , 2020, ArXiv.

[5]  Thomas G. Dietterich,et al.  Benchmarking Neural Network Robustness to Common Corruptions and Perturbations , 2018, ICLR.

[6]  Krishnakumar Balasubramanian,et al.  Unsupervised Supervised Learning I: Estimating Classification and Regression Errors without Labels , 2010, J. Mach. Learn. Res..

[7]  François Laviolette,et al.  Domain-Adversarial Training of Neural Networks , 2015, J. Mach. Learn. Res..

[8]  Andrew Y. Ng,et al.  Reading Digits in Natural Images with Unsupervised Feature Learning , 2011 .

[9]  Kilian Q. Weinberger,et al.  Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Kilian Q. Weinberger,et al.  On Calibration of Modern Neural Networks , 2017, ICML.

[11]  Trevor Darrell,et al.  Adapting Visual Category Models to New Domains , 2010, ECCV.

[12]  J. Zico Kolter,et al.  Assessing Generalization of SGD via Disagreement , 2021, ArXiv.

[13]  Zoubin Ghahramani,et al.  Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning , 2015, ICML.

[14]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[15]  Maya R. Gupta,et al.  To Trust Or Not To Trust A Classifier , 2018, NeurIPS.

[16]  Ching-Yao Chuang,et al.  Estimating Generalization under Distribution Shifts via Domain-Invariant Representations , 2020, ICML.

[17]  Percy Liang,et al.  Unsupervised Risk Estimation Using Only Conditional Independence Structure , 2016, NIPS.

[18]  Karan Goel,et al.  Mandoline: Model Evaluation under Distribution Shift , 2021, ICML.

[19]  D. Uribe Domain Adaptation in Sentiment Classification , 2010, 2010 Ninth International Conference on Machine Learning and Applications.

[20]  Yann LeCun,et al.  The mnist database of handwritten digits , 2005 .

[21]  Trevor Darrell,et al.  Predicting with Confidence on Unseen Distributions , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[22]  Sebastian Nowozin,et al.  Can You Trust Your Model's Uncertainty? Evaluating Predictive Uncertainty Under Dataset Shift , 2019, NeurIPS.

[23]  Balaji Lakshminarayanan,et al.  Deep Ensembles: A Loss Landscape Perspective , 2019, ArXiv.

[24]  Jure Leskovec,et al.  WILDS: A Benchmark of in-the-Wild Distribution Shifts , 2021, ICML.

[25]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[26]  Liang Zheng,et al.  Are Labels Always Necessary for Classifier Accuracy Evaluation? , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Tom M. Mitchell,et al.  Estimating Accuracy from Unlabeled Data , 2014, UAI.

[28]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Charles Blundell,et al.  Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles , 2016, NIPS.

[30]  Jonathan J. Hull,et al.  A Database for Handwritten Text Recognition Research , 1994, IEEE Trans. Pattern Anal. Mach. Intell..

[31]  Matthieu Cord,et al.  Addressing Failure Prediction by Learning Model Confidence , 2019, NeurIPS.

[32]  Yuval Kluger,et al.  Estimating the accuracies of multiple classifiers without labeled data , 2014, AISTATS.

[33]  Matthias Gallé,et al.  To Annotate or Not? Predicting Performance Drop under Domain Shift , 2019, EMNLP.

[34]  Sebastian Schelter,et al.  Learning to Validate the Predictions of Black Box Classifiers on Unseen Data , 2020, SIGMOD Conference.

[35]  S. Lo Piano,et al.  Ethical principles in machine learning and artificial intelligence: cases from the field and possible ways forward , 2020, Humanities and Social Sciences Communications.