Evaluating classification performance of biomarkers in two‐phase case‐control studies

Biomarkers are playing an increasingly important role in disease screening, early detection, and risk prediction. The two-phase case-control sampling study design is widely used for the evaluation of candidate biomarkers. The sampling probabilities for cases and controls in the second phase can often depend on other covariates (sampling strata). This biased sampling can lead to invalid inference on a biomarker's classification accuracy if not properly accounted for. In this paper, we adopt the idea of inverse probability weighting and develop inverse probability weighting-based estimators for various measures of a biomarker's classification performance, including the points on the receiver operating characteristics (ROCs) curve, the area under the ROC curve (area under the curve), and the partial area under the curve. In particular, we consider classification accuracy estimators using sampling weights estimated conditionally on sampling strata and further improve their efficiency through the use of estimated weights that additionally take into account the auxiliary variables available from the phase-one cohort. We develop asymptotic properties of the proposed estimators and provide analytical variance for making inference. Extensive simulation studies demonstrate excellent performance of the proposed weighted estimators, while the traditional empirical estimator can be severely biased. We also investigate the advantages in efficiency gain for estimating various classification accuracy estimators through the use of auxiliary variables in addition to sampling strata and apply the proposed method to examples from a renal artery stenosis study and a prostate cancer study.

[1]  A. Scott,et al.  Fitting regression models to case-control data by maximum likelihood , 1997 .

[2]  J. Neyman Contribution to the Theory of Sampling Human Populations , 1938 .

[3]  D. Horvitz,et al.  A Generalization of Sampling Without Replacement from a Finite Universe , 1952 .

[4]  Seongjoon Koo,et al.  PCA3: a molecular urine assay for predicting prostate biopsy outcome. , 2008, The Journal of urology.

[5]  Takumi Saegusa,et al.  WEIGHTED LIKELIHOOD ESTIMATION UNDER TWO-PHASE SAMPLING. , 2011, Annals of statistics.

[6]  Margaret Sullivan Pepe,et al.  The Analysis of Placement Values for Evaluating Discriminatory Measures , 2004, Biometrics.

[7]  Jon A Wellner,et al.  A Z-theorem with Estimated Nuisance Parameters and Correction Note for 'Weighted Likelihood for Semiparametric Models and Two-phase Stratified Samples, with Application to Cox Regression' , 2008, Scandinavian journal of statistics, theory and applications.

[8]  Margaret Sullivan Pepe,et al.  Biases introduced by choosing controls to match risk factors of cases in biomarker research. , 2012, Clinical chemistry.

[9]  Steven R. Lerman,et al.  The Estimation of Choice Probabilities from Choice Based Samples , 1977 .

[10]  Mitchell H. Gail,et al.  A family of nonparametric statistics for comparing diagnostic markers with paired or unpaired data , 1989 .

[11]  Xiao-Hua Zhou,et al.  Prospective studies of diagnostic test accuracy when disease prevalence is low. , 2002, Biostatistics.

[12]  Norman E. Breslow,et al.  Logistic regression for two-stage case-control data , 1988 .

[13]  D. McClish Analyzing a Portion of the ROC Curve , 1989, Medical decision making : an international journal of the Society for Medical Decision Making.

[14]  B. Turnbull,et al.  NONPARAMETRIC AND SEMIPARAMETRIC ESTIMATION OF THE RECEIVER OPERATING CHARACTERISTIC CURVE , 1996 .

[15]  Lori E. Dodd,et al.  Partial AUC Estimation and Regression , 2003, Biometrics.

[16]  Holly Janes,et al.  Pivotal Evaluation of the Accuracy of a Biomarker Used for Classification or Prediction: Standards for Study Design , 2008, Journal of the National Cancer Institute.

[17]  Ewout Steyerberg,et al.  A Clinical Prediction Rule for Renal Artery Stenosis , 1998, Annals of Internal Medicine.

[18]  Ying Huang Evaluating and comparing biomarkers with respect to the area under the receiver operating characteristics curve in two-phase case–control studies , 2016, Biostatistics.

[19]  Norman E. Breslow,et al.  Maximum Likelihood Estimation of Logistic Regression Parameters under Two‐phase, Outcome‐dependent Sampling , 1997 .

[20]  W Zucchini,et al.  On the statistical analysis of ROC curves. , 1989, Statistics in medicine.