Estimating and Explaining Model Performance When Both Covariates and Labels Shift

Deployed machine learning (ML) models often encounter new user data that differs from their training data. Therefore, estimating how well a given model might perform on the new data is an important step toward reliable ML applications. This is very challenging, however, as the data distribution can change in flexible ways, and we may not have any labels on the new data, which is often the case in monitoring settings. In this paper, we propose a new distribution shift model, Sparse Joint Shift (SJS), which considers the joint shift of both labels and a few features. This unifies and generalizes several existing shift models including label shift and sparse covariate shift, where only marginal feature or label distribution shifts are considered. We describe mathematical conditions under which SJS is identifiable. We further propose SEES, an algorithmic framework to characterize the distribution shift under SJS and to estimate a model’s performance on new data without any labels. We conduct extensive experiments on several real-world datasets with various ML models. Across different datasets and distribution shifts, SEES achieves significant (up to an order of magnitude) shift estimation error improvements over existing approaches.

[1]  Zachary Chase Lipton,et al.  Leveraging Unlabeled Data to Predict Out-of-Distribution Performance , 2022, ICLR.

[2]  Matei Zaharia,et al.  Did the Model Change? Efficiently Assessing Machine Learning API Shifts , 2021, ArXiv.

[3]  Moritz Hardt,et al.  Retiring Adult: New Datasets for Fair Machine Learning , 2021, NeurIPS.

[4]  Kilian Q. Weinberger,et al.  Online Adaptation to Label Distribution Shift , 2021, NeurIPS.

[5]  Trevor Darrell,et al.  Predicting with Confidence on Unseen Distributions , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[6]  Mayee F. Chen,et al.  Mandoline: Model Evaluation under Distribution Shift , 2021, ICML.

[7]  Frederick Liu,et al.  Detecting Errors and Estimating Accuracy on Unlabeled Data with Self-training Ensembles , 2021, NeurIPS.

[8]  Stephen Gould,et al.  What Does Rotation Prediction Tell Us about Classifier Accuracy under Varying Testing Environments? , 2021, ICML.

[9]  Pang Wei Koh,et al.  WILDS: A Benchmark of in-the-Wild Distribution Shifts , 2020, ICML.

[10]  Brian D. Ziebart,et al.  Robust Fairness under Covariate Shift , 2020, AAAI.

[11]  Yisong Yue,et al.  Active Learning under Label Shift , 2020, AISTATS.

[12]  Liang Zheng,et al.  Are Labels Always Necessary for Classifier Accuracy Evaluation? , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Ching-Yao Chuang,et al.  Estimating Generalization under Distribution Shifts via Domain-Invariant Representations , 2020, ICML.

[14]  Benjamin Recht,et al.  Measuring Robustness to Natural Distribution Shifts in Image Classification , 2020, NeurIPS.

[15]  Matthias Bethge,et al.  Improving robustness against common corruptions by covariate shift adaptation , 2020, NeurIPS.

[16]  Insup Lee,et al.  Calibrated Prediction with Covariate Shift via Unsupervised Domain Adaptation , 2020, AISTATS.

[17]  Hossein Mobahi,et al.  Fantastic Generalization Measures and Where to Find Them , 2019, ICLR.

[18]  Avanti Shrikumar,et al.  Maximum Likelihood with Bias-Corrected Calibration is Hard-To-Beat at Label Shift Adaptation , 2020, ICML.

[19]  Patrick J. Grother,et al.  Face recognition vendor test part 3: , 2019 .

[20]  Husanbir Singh Pannu,et al.  A Systematic Review on Imbalanced Data Challenges in Machine Learning , 2019, ACM Comput. Surv..

[21]  Emmanuel J. Candès,et al.  Conformal Prediction Under Covariate Shift , 2019, NeurIPS.

[22]  Benjamin Recht,et al.  Do ImageNet Classifiers Generalize to ImageNet? , 2019, ICML.

[23]  Stephan Günnemann,et al.  Failing Loudly: An Empirical Study of Methods for Detecting Dataset Shift , 2018, NeurIPS.

[24]  Alexander J. Smola,et al.  Detecting and Correcting for Label Shift with Black Box Predictors , 2018, ICML.

[25]  Pasin Israsena,et al.  EEG-Based Emotion Recognition Using Deep Learning Network with Principal Component Based Covariate Shift Adaptation , 2014, TheScientificWorldJournal.

[26]  Pietro Perona,et al.  A Lazy Man's Approach to Benchmarking: Semisupervised Classifier Evaluation and Recalibration , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[27]  Takafumi Kanamori,et al.  Relative Density-Ratio Estimation for Robust Distribution Comparison , 2011, Neural Computation.

[28]  Masashi Sugiyama,et al.  Importance-weighted least-squares probabilistic classifier for covariate shift adaptation with application to human activity recognition , 2012, Neurocomputing.

[29]  Francisco Herrera,et al.  A unifying view on dataset shift in classification , 2012, Pattern Recognit..

[30]  Krishnakumar Balasubramanian,et al.  Unsupervised Supervised Learning I: Estimating Classification and Regression Errors without Labels , 2010, J. Mach. Learn. Res..

[31]  I-Cheng Yeh,et al.  The comparisons of data mining techniques for the predictive accuracy of probability of default of credit card clients , 2009, Expert Syst. Appl..

[32]  Karsten M. Borgwardt,et al.  Covariate Shift by Kernel Mean Matching , 2009, NIPS 2009.

[33]  George Forman,et al.  Quantifying counts and costs via classification , 2008, Data Mining and Knowledge Discovery.

[34]  M. Kawanabe,et al.  Direct importance estimation for covariate shift adaptation , 2008 .

[35]  Motoaki Kawanabe,et al.  Direct Importance Estimation with Model Selection and Its Application to Covariate Shift Adaptation , 2007, NIPS.

[36]  Klaus-Robert Müller,et al.  Covariate Shift Adaptation by Importance Weighted Cross Validation , 2007, J. Mach. Learn. Res..

[37]  Steffen Bickel,et al.  Dirichlet-Enhanced Spam Filtering based on Biased Samples , 2006, NIPS.

[38]  H. Shimodaira,et al.  Improving predictive inference under covariate shift by weighting the log-likelihood function , 2000 .

[39]  Steven R. Lerman,et al.  The Estimation of Choice Probabilities from Choice Based Samples , 1977 .

[40]  J J Gart,et al.  Comparison of a screening test and a reference test in epidemiologic studies. I. Indices of agreement and their relation to prevalence. , 1966, American journal of epidemiology.