Research issues and strategies for genomic and proteomic biomarker discovery and validation: a statistical perspective.

The development and validation of clinically useful biomarkers from high-dimensional genomic and proteomic information pose great research challenges. Present bottlenecks include: that few of the biomarkers showing promise in initial discovery were found to warrant subsequent validation; and biomarker validation is expensive and time consuming. Biomarker evaluation should proceed in an orderly fashion to enhance rigor and efficiency. A molecular profiling approach, although promising, has a high chance of yielding biased results and overfitted models. Specimens from cohorts or intervention trials are essential to eliminate biases. The high cost for biomarker validation motivates some novel study design features, including sequential filtering and DNA pooling. For data analysis, logistic regression (in particular, boosting logistic regression) has features of robustness against model misspecification, and has resistance to model overfitting. Model assessment and cross-validation are critical components of data analysis. Having an independent test set is a vital feature of study design.

[1]  Yudong D. He,et al.  Gene expression profiling predicts clinical outcome of breast cancer , 2002, Nature.

[2]  D. Gunnell,et al.  Comparison of trends in prostate-cancer mortality in England and Wales and the USA , 2000, The Lancet.

[3]  S. Dudoit,et al.  Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data , 2002 .

[4]  J. Copas,et al.  Overestimation of the receiver operating characteristic curve for logistic regression , 2002 .

[5]  E. Petricoin,et al.  Use of proteomic patterns in serum to identify ovarian cancer , 2002, The Lancet.

[6]  D. Ransohoff Rules of evidence for cancer molecular-marker discovery and validation , 2004, Nature Reviews Cancer.

[7]  Mitchell H Katz,et al.  Multivariable Analysis: A Primer for Readers of Medical Research , 2003, Annals of Internal Medicine.

[8]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[9]  P. Visscher,et al.  SNP genotyping on pooled DNAs: comparison of genotyping technologies and a semi automated method for data storage and analysis. , 2002, Nucleic acids research.

[10]  E. Diamandis Analysis of serum proteomic patterns for early cancer diagnosis: drawing attention to potential problems. , 2004, Journal of the National Cancer Institute.

[11]  Christopher J. C. Burges,et al.  A Tutorial on Support Vector Machines for Pattern Recognition , 1998, Data Mining and Knowledge Discovery.

[12]  Brian D. Ripley,et al.  Pattern Recognition and Neural Networks , 1996 .

[13]  J. Friedman Special Invited Paper-Additive logistic regression: A statistical view of boosting , 2000 .

[14]  B. Efron How Biased is the Apparent Error Rate of a Prediction Rule , 1986 .

[15]  Laura J. Scott,et al.  High-throughput screening for evidence of association by using mass spectrometry genotyping on DNA pools , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[16]  M S Pepe,et al.  Phases of biomarker development for early detection of cancer. , 2001, Journal of the National Cancer Institute.

[17]  J. Potter,et al.  A data-analytic strategy for protein biomarker discovery: profiling of high-dimensional proteomic data for cancer detection. , 2003, Biostatistics.

[18]  Margaret Sullivan Pepe,et al.  Combining Several Screening Tests: Optimality of the Risk Score , 2002, Biometrics.

[19]  Varshal K. Davé,et al.  Signal amplification by rolling circle amplification on DNA microarrays. , 2001, Nucleic acids research.

[20]  Yudong D. He,et al.  A Gene-Expression Signature as a Predictor of Survival in Breast Cancer , 2002 .

[21]  Tianxi Cai,et al.  Application of the Time‐Dependent ROC Curves for Prognostic Accuracy with Multiple Biomarkers , 2006, Biometrics.

[22]  D. Clayton,et al.  Identification of the sources of error in allele frequency estimations from pooled DNA indicates an optimal experimental design. , 2002, Annals of human genetics.

[23]  P. Schellhammer,et al.  Boosted decision tree analysis of surface-enhanced laser desorption/ionization mass spectral serum profiles discriminates prostate cancer from noncancer patients. , 2002, Clinical chemistry.

[24]  S. Kingsmore,et al.  Comprehensive human genome amplification using multiple displacement amplification , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[25]  Toshihiro Tanaka The International HapMap Project , 2003, Nature.

[26]  M. Stone Cross‐Validatory Choice and Assessment of Statistical Predictions , 1976 .

[27]  J. Crowley,et al.  Prevalence of prostate cancer among men with a prostate-specific antigen level < or =4.0 ng per milliliter. , 2004, The New England journal of medicine.

[28]  M. O’Donovan,et al.  DNA Pooling: a tool for large-scale association studies , 2002, Nature Reviews Genetics.

[29]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[30]  S G Baker,et al.  Identifying Combinations of Cancer Markers for Further Study as Triggers of Early Intervention , 2000, Biometrics.

[31]  Roger S Lasken,et al.  High-throughput genotyping of single nucleotide polymorphisms with rolling circle amplification , 2001, BMC Genomics.

[32]  R. Tibshirani,et al.  Statistical Applications in Genetics and Molecular Biology Pre-validation and inference in microarrays , 2011 .

[33]  M. Radmacher,et al.  Pitfalls in the use of DNA microarray data for diagnostic and prognostic classification. , 2003, Journal of the National Cancer Institute.

[34]  R. Tibshirani,et al.  Significance analysis of microarrays applied to the ionizing radiation response , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[35]  A S Whittemore,et al.  Prostate cancer incidence and mortality in the United States and the United Kingdom. , 1998, Journal of the National Cancer Institute.

[36]  Sudhir Srivastava,et al.  The Early Detection Research Network Second Annual Scientific Workshop 14–16 October 2001, Seattle, Washington, USA , 2002, Disease markers.

[37]  M. Schummer,et al.  Selecting Differentially Expressed Genes from Microarray Experiments , 2003, Biometrics.