Unknown Examples & Machine Learning Model Generalization

Over the past decades, researchers and ML practitioners have come up with better and better ways to build, understand and improve the quality of ML models, but mostly under the key assumption that the training data is distributed identically to the testing data. In many real-world applications, however, some potential training examples are unknown to the modeler, due to sample selection bias or, more generally, covariate shift, i.e., a distribution shift between the training and deployment stage. The resulting discrepancy between training and testing distributions leads to poor generalization performance of the ML model and hence biased predictions. We provide novel algorithms that estimate the number and properties of these unknown training examples---unknown unknowns. This information can then be used to correct the training set, prior to seeing any test data. The key idea is to combine species-estimation techniques with data-driven methods for estimating the feature values for the unknown unknowns. Experiments on a variety of ML models and datasets indicate that taking the unknown examples into account can yield a more robust ML model that generalizes better.

[1]  Kadija Ferryman,et al.  Fairness in precision medicine , 2018 .

[2]  J. Wolfowitz,et al.  On a Test Whether Two Samples are from the Same Population , 1940 .

[3]  Bernhard Schölkopf,et al.  Correcting Sample Selection Bias by Unlabeled Data , 2006, NIPS.

[4]  Ron Kohavi,et al.  Scaling Up the Accuracy of Naive-Bayes Classifiers: A Decision-Tree Hybrid , 1996, KDD.

[5]  Qiang Yang,et al.  A Survey on Transfer Learning , 2010, IEEE Transactions on Knowledge and Data Engineering.

[6]  Vikas Sindhwani,et al.  Data Quality from Crowdsourcing: A Study of Annotation Selection Criteria , 2009, HLT-NAACL 2009.

[7]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[8]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[9]  Daniel S. Weld,et al.  A Coverage-Based Utility Model for Identifying Unknown Unknowns , 2018, AAAI.

[10]  Neil D. Lawrence,et al.  Dataset Shift in Machine Learning , 2009 .

[11]  Brian D. Ziebart,et al.  Robust Classification Under Sample Selection Bias , 2014, NIPS.

[12]  Christopher F. Parmeter,et al.  Normal reference bandwidths for the general order, multivariate kernel density derivative estimator , 2012 .

[13]  Carlos Guestrin,et al.  "Why Should I Trust You?": Explaining the Predictions of Any Classifier , 2016, ArXiv.

[14]  I. Good THE POPULATION FREQUENCIES OF SPECIES AND THE ESTIMATION OF POPULATION PARAMETERS , 1953 .

[15]  Daniel Marcu,et al.  Domain Adaptation for Statistical Classifiers , 2006, J. Artif. Intell. Res..

[16]  J. Ross Quinlan,et al.  Combining Instance-Based and Model-Based Learning , 1993, ICML.

[17]  Horst Bunke,et al.  Off-Line, Handwritten Numeral Recognition by Perturbation Method , 1997, IEEE Trans. Pattern Anal. Mach. Intell..

[18]  Eric Horvitz,et al.  Identifying Unknown Unknowns in the Open World: Representations and Policies for Guided Exploration , 2016, AAAI.

[19]  Steffen Bickel,et al.  Discriminative learning for differing training and test distributions , 2007, ICML '07.

[20]  Bianca Zadrozny,et al.  Learning and evaluating classifiers under sample selection bias , 2004, ICML.

[21]  Matthew Lease,et al.  On Quality Control and Machine Learning in Crowdsourcing , 2011, Human Computation.

[22]  Girijesh Prasad,et al.  EWMA model based shift-detection methods for detecting covariate shifts in non-stationary environments , 2015, Pattern Recognit..

[23]  H. Shimodaira,et al.  Improving predictive inference under covariate shift by weighting the log-likelihood function , 2000 .

[24]  Purnamrita Sarkar,et al.  Answering enumeration queries with the crowd , 2015, Commun. ACM.

[25]  David Mease,et al.  Explaining the Success of AdaBoost and Random Forests as Interpolating Classifiers , 2015, J. Mach. Learn. Res..

[26]  Masashi Sugiyama,et al.  Learning under nonstationarity: covariate shift and class‐balance change , 2013 .

[27]  Christopher M. Bishop,et al.  Pattern Recognition and Machine Learning (Information Science and Statistics) , 2006 .

[28]  R. A. Leibler,et al.  On Information and Sufficiency , 1951 .

[29]  Tim Kraska,et al.  A Data Quality Metric (DQM): How to Estimate the Number of Undetected Errors in Data Sets , 2016, Proc. VLDB Endow..

[30]  Jeffrey F. Naughton,et al.  Sampling-Based Estimation of the Number of Distinct Values of an Attribute , 1995, VLDB.

[31]  J. Schafer,et al.  Missing data: our view of the state of the art. , 2002, Psychological methods.

[32]  George Karypis,et al.  A Comparison of Document Clustering Techniques , 2000 .

[33]  Panagiotis G. Ipeirotis,et al.  Beat the Machine: Challenging Workers to Find the Unknown Unknowns , 2011, Human Computation.