Learning to Validate the Predictions of Black Box Classifiers on Unseen Data

Machine Learning (ML) models are difficult to maintain in production settings. In particular, deviations of the unseen serving data (for which we want to compute predictions) from the source data (on which the model was trained) pose a central challenge, especially when model training and prediction are outsourced via cloud services. Errors or shifts in the serving data can affect the predictive quality of a model, but are hard to detect for engineers operating ML deployments. We propose a simple approach to automate the validation of deployed ML models by estimating the model's predictive performance on unseen, unlabeled serving data. In contrast to existing work, we do not require explicit distributional assumptions on the dataset shift between the source and serving data. Instead, we rely on a programmatic specification of typical cases of dataset shift and data errors. We use this information to learn a performance predictor for a pretrained black box model that automatically raises alarms when it detects performance drops on unseen serving data. We experimentally evaluate our approach on various datasets, models and error types. We find that it reliably predicts the performance of black box models in the majority of cases, and outperforms several baselines even in the presence of unspecified data errors.

[1]  SchefferTobias,et al.  Discriminative Learning Under Covariate Shift , 2009 .

[2]  D. Sculley,et al.  The Data Linter: Lightweight Automated Sanity Checking for ML Data Sets , 2017 .

[3]  Yeye He,et al.  Uni-Detect: A Unified Approach to Automated Error Detection in Tables , 2019, SIGMOD Conference.

[4]  Tim Kraska,et al.  Slice Finder: Automated Data Slicing for Model Validation , 2018, 2019 IEEE 35th International Conference on Data Engineering (ICDE).

[5]  Stephan Günnemann,et al.  Failing Loudly: An Empirical Study of Methods for Detecting Dataset Shift , 2018, NeurIPS.

[6]  Xin Zhang,et al.  TFX: A TensorFlow-Based Production-Scale Machine Learning Platform , 2017, KDD.

[7]  Felix Bießmann,et al.  On Challenges in Machine Learning Model Management , 2018, IEEE Data Eng. Bull..

[8]  Volker Markl,et al.  Learning to Validate the Predictions of Black Box Machine Learning Models on Unseen Data , 2019, HILDA@SIGMOD.

[9]  Ian Goodfellow,et al.  TensorFuzz: Debugging Neural Networks with Coverage-Guided Fuzzing , 2018, ICML.

[10]  Berthold Reinwald,et al.  Efficient sample generation for scalable meta learning , 2015, 2015 IEEE 31st International Conference on Data Engineering.

[11]  Motoaki Kawanabe,et al.  Machine Learning in Non-Stationary Environments - Introduction to Covariate Shift Adaptation , 2012, Adaptive computation and machine learning.

[12]  Wes McKinney,et al.  Data Structures for Statistical Computing in Python , 2010, SciPy.

[13]  Bernhard Schölkopf,et al.  Correcting Sample Selection Bias by Unlabeled Data , 2006, NIPS.

[14]  Jeffrey F. Naughton,et al.  Model Selection Management Systems: The Next Frontier of Advanced Analytics , 2016, SGMD.

[15]  Bernhard Schölkopf,et al.  Domain Adaptation under Target and Conditional Shift , 2013, ICML.

[16]  Tim Kraska,et al.  Slice Finder: Automated Data Slicing for Model Interpretability , 2017 .

[17]  Burr Settles,et al.  Active Learning Literature Survey , 2009 .

[18]  D. Sculley,et al.  Hidden Technical Debt in Machine Learning Systems , 2015, NIPS.

[19]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[20]  Steffen Bickel,et al.  Discriminative Learning Under Covariate Shift , 2009, J. Mach. Learn. Res..

[21]  Randal S. Olson,et al.  Automating Biomedical Data Science Through Tree-Based Pipeline Optimization , 2016, EvoApplications.

[22]  Neoklis Polyzotis,et al.  Data Lifecycle Challenges in Production Machine Learning , 2018, SIGMOD Rec..

[23]  Qingquan Song,et al.  Auto-Keras: Efficient Neural Architecture Search with Network Morphism , 2018 .

[24]  Joseph M. Hellerstein,et al.  Quantitative Data Cleaning for Large Databases , 2008 .

[25]  Junfeng Yang,et al.  DeepXplore: Automated Whitebox Testing of Deep Learning Systems , 2017, SOSP.

[26]  Alexander J. Smola,et al.  Detecting and Correcting for Label Shift with Black Box Predictors , 2018, ICML.

[27]  J. Manthorpe Land Registration and Land Valuation in the United Kingdom and in the Countries of the United Nations Economic Commission for Europe (UNECE) , 1998 .

[28]  K. Müller,et al.  Finding stationary subspaces in multivariate time series. , 2009, Physical review letters.

[29]  Neil D. Lawrence,et al.  Dataset Shift in Machine Learning , 2009 .

[30]  Aaron Klein,et al.  Efficient and Robust Automated Machine Learning , 2015, NIPS.

[31]  Tianqi Chen,et al.  XGBoost: A Scalable Tree Boosting System , 2016, KDD.

[32]  Neoklis Polyzotis,et al.  Data Validation for Machine Learning , 2019, SysML.

[33]  Felix Bießmann,et al.  Automating Large-Scale Data Quality Verification , 2018, Proc. VLDB Endow..