Correcting sample selection bias in maximum entropy density estimation

We study the problem of maximum entropy density estimation in the presence of known sample selection bias. We propose three bias correction approaches. The first one takes advantage of unbiased sufficient statistics which can be obtained from biased samples. The second one estimates the biased distribution and then factors the bias out. The third one approximates the second by only using samples from the sampling distribution. We provide guarantees for the first two approaches and evaluate the performance of all three approaches in synthetic experiments and on real data from species habitat modeling, where maxent has been successfully applied and where sample selection bias is a significant problem.

[1]  J. Heckman Sample selection bias as a specification error , 1979 .

[2]  W. Greene Sample Selection Bias as a Specification Error: Comment , 1981 .

[3]  J. Hanley,et al.  The meaning and use of the area under a receiver operating characteristic (ROC) curve. , 1982, Radiology.

[4]  Roderick J. A. Little,et al.  Statistical Analysis with Missing Data , 1988 .

[5]  R. Groves,et al.  Survey Errors and Survey Costs. , 1991 .

[6]  Antoine Guisan,et al.  Predictive habitat distribution models in ecology , 2000 .

[7]  J. Elith Quantitative Methods for Modeling Species Habitat: Comparative Performance and an Application to Australian Plants , 2000 .

[8]  Charles Elkan,et al.  The Foundations of Cost-Sensitive Learning , 2001, IJCAI.

[9]  S. Reddy,et al.  Geographical sampling bias and its implications for conservation priorities in Africa , 2003 .

[10]  John Langford,et al.  Cost-sensitive learning by cost-proportionate example weighting , 2003, Third IEEE International Conference on Data Mining.

[11]  Barbara R Stein,et al.  Mammals of the World: MaNIS as an example of data integration in a distributed network environment , 2004 .

[12]  Bianca Zadrozny,et al.  Learning and evaluating classifiers under sample selection bias , 2004, ICML.

[13]  Miroslav Dudík,et al.  A maximum entropy approach to species distribution modeling , 2004, ICML.

[14]  Miroslav Dudík,et al.  Performance Guarantees for Regularized Maximum Entropy Density Estimation , 2004, COLT.