Fitting logistic regression models with contaminated case-control data

Abstract Errors in measurement frequently occur in observing responses. If case–control data are based on certain reported responses, which may not be the true responses, then we have contaminated case–control data. In this paper, we first show that the ordinary logistic regression analysis based on contaminated case–control data can lead to very serious biased conclusions. This can be concluded from the results of a theoretical argument, one example, and two simulation studies. We next derive the semiparametric maximum likelihood estimate (MLE) of the risk parameter of a logistic regression model when there is a validation subsample. The asymptotic normality of the semiparametric MLE will be shown along with consistent estimate of asymptotic variance. Our example and two simulation studies show these estimates to have reasonable performance under finite sample situations.