Type-Independent Correction of Sample Selection Bias via Structural Discovery and Re-balancing

Sample selection bias is a common problem in many real world applications, where training data are obtained under realistic constraints that make them follow a different distribution from the future testing data. For example, in the application of hospital clinical studies, it is common practice to build models from the eligible volunteers as the training data, and then apply the model to the entire populations. Because these volunteers are usually not selected at random, the training set may not be drawn from the same distribution as the test set. Thus, such a dataset suffers from “sample selection bias” or “covariate shift”. In the past few years, much work has been proposed to reduce sample selection bias, mainly by statically matching the distribution between training set and test set. But in this paper, we do not explore the different distributions directly. Instead, we propose to discover the natural structure of the target distribution, by which different types of sample selection biases can be evidently observed and then be reduced by generating a new sample set from the structure. In particular, unlabeled data are involved in the new sample set to enhance the ability to minimize sample selection bias. One main advantage of the proposed approach is that it can correct all types of sample selection biases, while most of the previously proposed approaches are designed for some specific types of biases. In experimental studies, we simulate all 3 types of sample selection biases on 17 different classification problems, thus 17×3 biased datasets are used to test the performance of the proposed algorithm. The baseline models include decision tree, naive Bayes, nearest neighbor, and logistic regression. Across all combinations, the increase in accuracy over noncorrected sample set is 30% on average using each baseline model.

[1]  Ian Davidson,et al.  On Sample Selection Bias and Its Efficient Correction via Model Averaging and Unlabeled Examples , 2007, SDM.

[2]  Aaas News,et al.  Book Reviews , 1893, Buffalo Medical and Surgical Journal.

[3]  Bernhard Schölkopf,et al.  Correcting Sample Selection Bias by Unlabeled Data , 2006, NIPS.

[4]  Charles Elkan,et al.  A Bayesian network framework for reject inference , 2004, KDD.

[5]  Sergio M. Savaresi,et al.  On the performance of bisecting K-means and PDDP , 2001, SDM.

[6]  Steffen Bickel,et al.  Dirichlet-Enhanced Spam Filtering based on Biased Samples , 2006, NIPS.

[7]  Charles Elkan,et al.  Making generative classifiers robust to selection bias , 2007, KDD '07.

[8]  Robert P. W. Duin,et al.  Prototype selection for dissimilarity-based classifiers , 2006, Pattern Recognit..

[9]  Bianca Zadrozny,et al.  Learning and evaluating classifiers under sample selection bias , 2004, ICML.

[10]  Miroslav Dudík,et al.  Correcting sample selection bias in maximum entropy density estimation , 2005, NIPS.

[11]  Steffen Bickel,et al.  Discriminative learning for differing training and test distributions , 2007, ICML '07.

[12]  Ji Zhu,et al.  A Method for Inferring Label Sampling Mechanisms in Semi-Supervised Learning , 2004, NIPS.

[13]  Philip S. Yu,et al.  An improved categorization of classifier's sensitivity on sample selection bias , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[14]  J. Heckman Sample selection bias as a specification error , 1979 .

[15]  Nitesh V. Chawla,et al.  Learning From Labeled And Unlabeled Data: An Empirical Study Across Techniques And Domains , 2011, J. Artif. Intell. Res..