LOCAL CASE-CONTROL SAMPLING: EFFICIENT SUBSAMPLING IN IMBALANCED DATA SETS.

For classification problems with significant class imbalance, subsampling can reduce computational costs at the price of inflated variance in estimating model parameters. We propose a method for subsampling efficiently for logistic regression by adjusting the class balance locally in feature space via an accept-reject scheme. Our method generalizes standard case-control sampling, using a pilot estimate to preferentially select examples whose responses are conditionally rare given their features. The biased subsampling is corrected by a post-hoc analytic adjustment to the parameters. The method is simple and requires one parallelizable scan over the full data set. Standard case-control sampling is inconsistent under model misspecification for the population risk-minimizing coefficients θ*. By contrast, our estimator is consistent for θ* provided that the pilot estimate is. Moreover, under correct specification and with a consistent, independent pilot estimate, our estimator has exactly twice the asymptotic variance of the full-sample MLE-even if the selected subsample comprises a miniscule fraction of the full data set, as happens when the original data are severely imbalanced. The factor of two improves to [Formula: see text] if we multiply the baseline acceptance probabilities by c > 1 (and weight points with acceptance probability greater than 1), taking roughly [Formula: see text] times as many data points into the subsample. Experiments on simulated and real data show that our method can substantially outperform standard case-control subsampling.

[1]  Léon Bottou,et al.  The Tradeoffs of Large Scale Learning , 2007, NIPS.

[2]  Trevor Hastie,et al.  Additive Logistic Regression : a Statistical , 1998 .

[3]  Witold Pedrycz,et al.  Editorial to the special issue , 2009, TAAS.

[4]  C. Manski,et al.  The Logit Model and Response-Based Samples , 1989 .

[5]  C R Weinberg,et al.  The design and analysis of case-control studies with biased sampling. , 1990, Biometrics.

[6]  D. Horvitz,et al.  A Generalization of Sampling Without Replacement from a Finite Universe , 1952 .

[7]  Gary M. Weiss Mining with rarity: a unifying framework , 2004, SKDD.

[8]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1997, EuroCOLT.

[9]  A. Scott,et al.  On the robustness of weighted methods for fitting models to case–control data , 2002 .

[10]  J. Friedman Special Invited Paper-Additive logistic regression: A statistical view of boosting , 2000 .

[11]  Nitesh V. Chawla,et al.  Editorial: special issue on learning from imbalanced data sets , 2004, SKDD.

[12]  Calton Pu,et al.  Introducing the Webb Spam Corpus: Using Email Spam to Identify Web Spam Automatically , 2006, CEAS.

[13]  R. Pyke,et al.  Logistic disease incidence models and case-control studies , 1979 .

[14]  A. Scott,et al.  Fitting Logistic Models Under Case‐Control or Choice Based Sampling , 1986 .

[15]  J. Anderson Separate sample logistic discrimination , 1972 .

[16]  Norman E. Breslow,et al.  Logistic regression for two-stage case-control data , 1988 .

[17]  T. Fears,et al.  Logistic regression methods for retrospective case-control studies using complex sampling procedures. , 1986, Biometrics.

[18]  C. Manski,et al.  Estimation of best predictors of binary response , 1989 .

[19]  N E Day,et al.  Statistical methods in cancer research. IARC Workshop 25-27 May 1983. , 1987, IARC scientific publications.

[20]  Haibo He,et al.  Learning from Imbalanced Data , 2009, IEEE Transactions on Knowledge and Data Engineering.

[21]  W. Haenszel,et al.  Statistical aspects of the analysis of data from retrospective studies of disease. , 1959, Journal of the National Cancer Institute.

[22]  A. Scott,et al.  Fitting Logistic Regression Models in Stratified Case-Control Studies , 1991 .