论文信息 - LOCAL CASE-CONTROL SAMPLING: EFFICIENT SUBSAMPLING IN IMBALANCED DATA SETS. - 字舞流文

LOCAL CASE-CONTROL SAMPLING: EFFICIENT SUBSAMPLING IN IMBALANCED DATA SETS.

For classification problems with significant class imbalance, subsampling can reduce computational costs at the price of inflated variance in estimating model parameters. We propose a method for subsampling efficiently for logistic regression by adjusting the class balance locally in feature space via an accept-reject scheme. Our method generalizes standard case-control sampling, using a pilot estimate to preferentially select examples whose responses are conditionally rare given their features. The biased subsampling is corrected by a post-hoc analytic adjustment to the parameters. The method is simple and requires one parallelizable scan over the full data set. Standard case-control sampling is inconsistent under model misspecification for the population risk-minimizing coefficients θ*. By contrast, our estimator is consistent for θ* provided that the pilot estimate is. Moreover, under correct specification and with a consistent, independent pilot estimate, our estimator has exactly twice the asymptotic variance of the full-sample MLE-even if the selected subsample comprises a miniscule fraction of the full data set, as happens when the original data are severely imbalanced. The factor of two improves to [Formula: see text] if we multiply the baseline acceptance probabilities by c > 1 (and weight points with acceptance probability greater than 1), taking roughly [Formula: see text] times as many data points into the subsample. Experiments on simulated and real data show that our method can substantially outperform standard case-control subsampling.

Trevor Hastie | William Fithian | T. Hastie | William Fithian

[1] Léon Bottou,et al. The Tradeoffs of Large Scale Learning , 2007, NIPS.

[2] Trevor Hastie,et al. Additive Logistic Regression : a Statistical , 1998 .

[3] Witold Pedrycz,et al. Editorial to the special issue , 2009, TAAS.

[4] C. Manski,et al. The Logit Model and Response-Based Samples , 1989 .

[5] C R Weinberg,et al. The design and analysis of case-control studies with biased sampling. , 1990, Biometrics.

[6] D. Horvitz,et al. A Generalization of Sampling Without Replacement from a Finite Universe , 1952 .

[7] Gary M. Weiss. Mining with rarity: a unifying framework , 2004, SKDD.

[8] Yoav Freund,et al. A decision-theoretic generalization of on-line learning and an application to boosting , 1997, EuroCOLT.

[9] A. Scott,et al. On the robustness of weighted methods for fitting models to case–control data , 2002 .

[10] J. Friedman. Special Invited Paper-Additive logistic regression: A statistical view of boosting , 2000 .

[11] Nitesh V. Chawla,et al. Editorial: special issue on learning from imbalanced data sets , 2004, SKDD.

[12] Calton Pu,et al. Introducing the Webb Spam Corpus: Using Email Spam to Identify Web Spam Automatically , 2006, CEAS.

[13] R. Pyke,et al. Logistic disease incidence models and case-control studies , 1979 .

[14] A. Scott,et al. Fitting Logistic Models Under Case‐Control or Choice Based Sampling , 1986 .

[15] J. Anderson. Separate sample logistic discrimination , 1972 .

[16] Norman E. Breslow,et al. Logistic regression for two-stage case-control data , 1988 .

[17] T. Fears,et al. Logistic regression methods for retrospective case-control studies using complex sampling procedures. , 1986, Biometrics.

[18] C. Manski,et al. Estimation of best predictors of binary response , 1989 .

[19] N E Day,et al. Statistical methods in cancer research. IARC Workshop 25-27 May 1983. , 1987, IARC scientific publications.

[20] Haibo He,et al. Learning from Imbalanced Data , 2009, IEEE Transactions on Knowledge and Data Engineering.

[21] W. Haenszel,et al. Statistical aspects of the analysis of data from retrospective studies of disease. , 1959, Journal of the National Cancer Institute.

[22] A. Scott,et al. Fitting Logistic Regression Models in Stratified Case-Control Studies , 1991 .