Accounting for control mislabeling in case-control biomarker studies.

In biomarker discovery studies, uncertainty associated with case and control labels is often overlooked. By omitting to take into account label uncertainty, model parameters and the predictive risk can become biased, sometimes severely. The most common situation is when the control set contains an unknown number of undiagnosed, or future, cases. This has a marked impact in situations where the model needs to be well-calibrated, e.g., when the prediction performance of a biomarker panel is evaluated. Failing to account for class label uncertainty may lead to underestimation of classification performance and bias in parameter estimates. This can further impact on meta-analysis for combining evidence from multiple studies. Using a simulation study, we outline how conventional statistical models can be modified to address class label uncertainty leading to well-calibrated prediction performance estimates and reduced bias in meta-analysis. We focus on the problem of mislabeled control subjects in case-control studies, i.e., when some of the control subjects are undiagnosed cases, although the procedures we report are generic. The uncertainty in control status is a particular situation common in biomarker discovery studies in the context of genomic and molecular epidemiology, where control subjects are commonly sampled from the general population with an established expected disease incidence rate.

[1]  Li Hsu,et al.  Partially Supervised Learning Using an EM‐Boosting Algorithm , 2004, Biometrics.

[2]  Pierre Hainaut,et al.  Biobanking in a fast moving world: an international perspective. , 2011, Journal of the National Cancer Institute. Monographs.

[3]  T G Clark,et al.  Survival Analysis Part I: Basic concepts and first analyses , 2003, British Journal of Cancer.

[4]  Yoshihiro Yamanishi,et al.  Simultaneous inference of biological networks of multiple species from genome-wide data and evolutionary information: a semi-supervised approach , 2009, Bioinform..

[5]  J. Lindon,et al.  'Metabonomics': understanding the metabolic responses of living systems to pathophysiological stimuli via multivariate statistical analysis of biological NMR spectroscopic data. , 1999, Xenobiotica; the fate of foreign compounds in biological systems.

[6]  Nader Rifai,et al.  What is a biomarker? Research investments and lack of clinical integration necessitate a review of biomarker terminology and validation schema , 2010, Scandinavian journal of clinical and laboratory investigation. Supplementum.

[7]  M. Gerstein,et al.  RNA-Seq: a revolutionary tool for transcriptomics , 2009, Nature Reviews Genetics.

[8]  Enrico Blanzieri,et al.  Detecting potential labeling errors in microarrays by data perturbation , 2006, Bioinform..

[9]  W. Blackstock,et al.  Proteomics: quantitative and physical mapping of cellular proteins. , 1999, Trends in biotechnology.

[10]  Fay Betsou,et al.  Biobanking for better healthcare , 2008, Molecular oncology.

[11]  D. Hand,et al.  Idiot's Bayes—Not So Stupid After All? , 2001 .

[12]  Wensheng Zhang,et al.  Effects of Misdiagnosis in Input Data on the Identification of Differential Expression Genes in Incipient Alzheimer Patients , 2008, Silico Biol..

[13]  D. Altman,et al.  Analysis by Categorizing or Dichotomizing Continuous Variables Is Inadvisable: An Example from the Natural History of Unruptured Aneurysms , 2011, American Journal of Neuroradiology.

[14]  Vipin Kumar,et al.  Robust and efficient identification of biomarkers by classifying features on graphs , 2008, Bioinform..

[15]  K. Schulz,et al.  Case-control studies: research in reverse , 2002, The Lancet.

[16]  Simon C. Potter,et al.  Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls , 2007, Nature.

[17]  Douglas G Altman,et al.  Dichotomizing continuous predictors in multiple regression: a bad idea , 2006, Statistics in medicine.

[18]  L. Magder,et al.  Logistic regression when the outcome is measured with uncertainty. , 1997, American journal of epidemiology.

[19]  H Checkoway,et al.  Bias due to misclassification in the estimation of relative risk. , 1977, American journal of epidemiology.

[20]  P. Szatmari,et al.  Effects of misclassification on estimates of relative risk in family history studies , 1999, Genetic epidemiology.