Sample Selection Bias Correction Theory

This paper presents a theoretical analysis of sample selection bias correction. The sample bias correction technique commonly used in machine learning consists of reweighting the cost of an error on each training point of a biased sample to more closely reflect the unbiased distribution. This relies on weights derived by various estimation techniques based on finite samples. We analyze the effect of an error in that estimation on the accuracy of the hypothesis returned by the learning algorithm for two estimation techniques: a cluster-based estimation technique and kernel mean matching. We also report the results of sample bias correction experiments with several data sets using these techniques. Our analysis is based on the novel concept of distributional stabilitywhich generalizes the existing concept of point-based stability. Much of our work and proof techniques can be used to analyze other importance weighting techniques and their effect on accuracy when using a distributionally stable algorithm.

[1]  J. Heckman Sample selection bias as a specification error , 1979 .

[2]  Luc Devroye,et al.  Distribution-free performance bounds for potential function rules , 1979, IEEE Trans. Inf. Theory.

[3]  W. Greene Sample Selection Bias as a Specification Error: Comment , 1981 .

[4]  Roderick J. A. Little,et al.  Statistical Analysis with Missing Data , 1988 .

[5]  Dana Ron,et al.  Algorithmic Stability and Sanity-Check Bounds for Leave-one-Out Cross-Validation , 1997, COLT.

[6]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[7]  Alexander Gammerman,et al.  Ridge Regression Learning Algorithm in Dual Variables , 1998, ICML.

[8]  Dana Ron,et al.  Algorithmic Stability and Sanity-Check Bounds for Leave-One-Out Cross-Validation , 1997, Neural Computation.

[9]  M. Kearns,et al.  Algorithmic stability and sanity-check bounds for leave-one-out cross-validation , 1999 .

[10]  Charles Elkan,et al.  The Foundations of Cost-Sensitive Learning , 2001, IJCAI.

[11]  Ingo Steinwart,et al.  On the Influence of the Kernel on the Consistency of Support Vector Machines , 2002, J. Mach. Learn. Res..

[12]  André Elisseeff,et al.  Stability and Generalization , 2002, J. Mach. Learn. Res..

[13]  John Langford,et al.  Cost-sensitive learning by cost-proportionate example weighting , 2003, Third IEEE International Conference on Data Mining.

[14]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[15]  Bianca Zadrozny,et al.  Learning and evaluating classifiers under sample selection bias , 2004, ICML.

[16]  Miroslav Dudík,et al.  Correcting sample selection bias in maximum entropy density estimation , 2005, NIPS.

[17]  Philip S. Yu,et al.  An improved categorization of classifier's sensitivity on sample selection bias , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[18]  Bernhard Schölkopf,et al.  Correcting Sample Selection Bias by Unlabeled Data , 2006, NIPS.

[19]  Motoaki Kawanabe,et al.  Direct Importance Estimation with Model Selection and Its Application to Covariate Shift Adaptation , 2007, NIPS.

[20]  Miroslav Dudík,et al.  Hierarchical maximum entropy density estimation , 2007, ICML '07.

[21]  Steffen Bickel,et al.  Discriminative learning for differing training and test distributions , 2007, ICML '07.

[22]  M. Kawanabe,et al.  Direct importance estimation for covariate shift adaptation , 2008 .