Sample selection bias is a common problem encountered when using data mining algorithms for many real-world applications. Traditionally, it is assumed that training and test data are sampled from the same probability distribution, the so called “stationary or non-biased distribution assumption.” However, this assumption is often violated in reality. Typical examples include marketing solicitation, fraud detection, drug testing, loan approval, school enrollment, etc. For these applications the only labeled data available for training is a biased representation, in various ways, of the future data on which the inductive model will predict. Intuitively, some examples sampled frequently into the training data may actually be infrequent in the testing data, and vice versa. When this happens, an inductive model constructed from biased training set may not be as accurate on unbiased testing data if there had not been any selection bias in the training data. In this paper, we first improve and clarify a previously proposed categorization of sample selection bias. In particular, we show that unless under very restricted conditions, sample selection bias is a common problem for many real-world situations. We then analyze various effects of sample selection bias on inductive modeling, in particular, how the “true” conditional probability P (y|x) to be modeled by inductive learners can be misrepresented in the biased training data, that subsequently misleads a learning algorithm. To solve inaccuracy problems due to sample selection bias, we explore how to use model averaging of (1) conditional probabilities P (y|x), (2) feature probabilities P (x), and (3) joint probabilities, P (x, y), to reduce the influence of sample selection bias on model accuracy. In particular, we explore on how to use unlabeled data in a semi-supervised learning framework to improve the accuracy of descriptive models constructed from biased training samples. IBM T.J.Watson Research Center, Hawthorne, NY 10532, weifan@us.ibm.com Department of Computer Science, University at Albany, State University of New York, Albany, NY 12222, davidson@cs.albany.edu
[1]
Ji Zhu,et al.
A Method for Inferring Label Sampling Mechanisms in Semi-Supervised Learning
,
2004,
NIPS.
[2]
Philip S. Yu,et al.
An improved categorization of classifier's sensitivity on sample selection bias
,
2005,
Fifth IEEE International Conference on Data Mining (ICDM'05).
[3]
Charles Elkan,et al.
A Bayesian network framework for reject inference
,
2004,
KDD.
[4]
Nicole A. Lazar,et al.
Statistical Analysis With Missing Data
,
2003,
Technometrics.
[5]
Ian Davidson,et al.
When Efficient Model Averaging Out-Performs Boosting and Bagging
,
2006,
PKDD.
[6]
R. Kothari,et al.
Learning from labeled and unlabeled data
,
2002,
Proceedings of the 2002 International Joint Conference on Neural Networks. IJCNN'02 (Cat. No.02CH37290).
[7]
Bianca Zadrozny,et al.
Learning and evaluating classifiers under sample selection bias
,
2004,
ICML.
[8]
Zoran Obradovic,et al.
Exploiting unlabeled data for improving accuracy of predictive data mining
,
2003,
Third IEEE International Conference on Data Mining.
[9]
Bernhard Schölkopf,et al.
Correcting Sample Selection Bias by Unlabeled Data
,
2006,
NIPS.
[10]
Vladimir Vapnik,et al.
The Nature of Statistical Learning
,
1995
.
[11]
Massih-Reza Amini,et al.
Learning Classification with Both Labeled and Unlabeled Data
,
2002,
ECML.
[12]
J. Heckman.
Sample selection bias as a specification error
,
1979
.
[13]
Nitesh V. Chawla,et al.
Learning From Labeled And Unlabeled Data: An Empirical Study Across Techniques And Domains
,
2011,
J. Artif. Intell. Res..
[14]
Ian Davidson,et al.
Reverse testing: an efficient framework to select amongst classifiers under sample selection bias
,
2006,
KDD '06.