cancerclass: An R Package for Development and Validation of Diagnostic Tests from High-Dimensional Molecular Data

Progress in molecular high-throughput techniques has led to the opportunity of a comprehensive monitoring of biomolecules in medical samples. In the era of personalized medicine, these data form the basis for the development of diagnostic, prognostic and predictive tests for cancer. Because of the high number of features that are measured simultaneously in a relatively low number of samples, supervised learning approaches are sensitive to overfitting and performance overestimation. Bioinformatic methods were developed to cope with these problems including control of accuracy and precision. However, there is demand for easy-to-use software that integrates methods for classifier construction, performance assessment and development of diagnostic tests. To contribute to filling of this gap, we developed a comprehensive R package for the development and validation of diagnostic tests from high-dimensional molecular data. An important focus of the package is a careful validation of the classification results. To this end, we implemented an extended version of the multiple random validation protocol, a validation method that was introduced before. The package includes methods for continuous prediction scores. This is important in a clinical setting, because scores can be converted to probabilities and help to distinguish between clear-cut and borderline classification results. The functionality of the package is illustrated by the analysis of two cancer microarray data sets.

[1]  J. Budczies,et al.  A colorectal cancer expression profile that includes transforming growth factor beta inhibitor BAMBI predicts metastatic potential. , 2009, Gastroenterology.

[2]  Jean YH Yang,et al.  Bioconductor: open software development for computational biology and bioinformatics , 2004, Genome Biology.

[3]  R. Tibshirani,et al.  Outlier sums for differential gene expression analysis. , 2007, Biostatistics.

[4]  Royston Goodacre,et al.  Metabolic fingerprinting as a diagnostic tool. , 2007, Pharmacogenomics.

[5]  Anders Larsson,et al.  Elevated levels of thymidine kinase 1 peptide in serum from patients with breast cancer , 2009, Upsala Journal of Medical Sciences.

[6]  C. Sotiriou,et al.  Genomic grade index is associated with response to chemotherapy in patients with breast cancer. , 2009, Journal of clinical oncology : official journal of the American Society of Clinical Oncology.

[7]  J. Sparano,et al.  TAILORx: trial assigning individualized options for treatment (Rx). , 2006, Clinical breast cancer.

[8]  Yudong D. He,et al.  A Gene-Expression Signature as a Predictor of Survival in Breast Cancer , 2002 .

[9]  F. Kittrell,et al.  Overexpression of Separase induces aneuploidy and mammary tumorigenesis , 2008, Proceedings of the National Academy of Sciences.

[10]  Cor J. Veenman,et al.  A protocol for building and evaluating predictors of disease state based on microarray data , 2005, Bioinform..

[11]  Yudong D. He,et al.  Gene expression profiling predicts clinical outcome of breast cancer , 2002, Nature.

[12]  Kevin C. Dorff,et al.  The MicroArray Quality Control (MAQC)-II study of common practices for the development and validation of microarray-based predictive models , 2010, Nature Biotechnology.

[13]  Edward R. Dougherty,et al.  Is cross-validation valid for small-sample microarray classification? , 2004, Bioinform..

[14]  Xia Yun,et al.  Serum thymidine kinase 1 correlates to clinical stages and clinical reactions and monitors the outcome of therapy of 1,247 cancer patients in routine clinical settings , 2010, International Journal of Clinical Oncology.

[15]  J. Stec,et al.  Gene expression profiles predict complete pathologic response to neoadjuvant paclitaxel and fluorouracil, doxorubicin, and cyclophosphamide chemotherapy in breast cancer. , 2004, Journal of clinical oncology : official journal of the American Society of Clinical Oncology.

[16]  Leming Shi,et al.  Effect of training-sample size and classification difficulty on the accuracy of genomic predictors , 2010, Breast Cancer Research.

[17]  H. Katabuchi,et al.  Thymidine kinase in epithelial ovarian cancer: Relationship with the other pyrimidine pathway enzymes , 2002, International journal of cancer.

[18]  Edward R. Dougherty,et al.  Is cross-validation better than resubstitution for ranking genes? , 2004, Bioinform..

[19]  Stefan Michiels,et al.  Prediction of cancer outcome with microarrays: a multiple random validation strategy , 2005, The Lancet.

[20]  Geoffrey J McLachlan,et al.  Selection bias in gene extraction on the basis of microarray gene-expression data , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[21]  R. Tibshirani,et al.  Improvements on Cross-Validation: The 632+ Bootstrap Method , 1997 .

[22]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[23]  W. Weichert,et al.  A prognostic gene expression index in ovarian cancer—validation across different independent data sets , 2009, The Journal of pathology.

[24]  R. Spang,et al.  Predicting the clinical status of human breast cancer by using gene expression profiles , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[25]  Yusuke Nakamura,et al.  Elevated expression of protein regulator of cytokinesis 1, involved in the growth of breast cancer cells , 2007, Cancer science.

[26]  R. Greil,et al.  A New Molecular Predictor of Distant Recurrence in ER-Positive, HER2-Negative Breast Cancer Adds Independent Information to Conventional Clinical Risk Factors , 2011, Clinical Cancer Research.

[27]  M. Stratton,et al.  The cancer genome , 2009, Nature.

[28]  R. Tibshirani,et al.  Diagnosis of multiple cancer types by shrunken centroids of gene expression , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[29]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[30]  M. Radmacher,et al.  Pitfalls in the use of DNA microarray data for diagnostic and prognostic classification. , 2003, Journal of the National Cancer Institute.

[31]  Jeremy J. W. Chen,et al.  A five-gene signature and clinical outcome in non-small-cell lung cancer. , 2007, The New England journal of medicine.

[32]  Philip Lijnzaad,et al.  An expression profile for diagnosis of lymph node metastases from primary head and neck squamous cell carcinomas , 2005, Nature Genetics.

[33]  Rainer Spang,et al.  Computational diagnostics with gene expression profiles. , 2008, Methods in molecular biology.

[34]  Van,et al.  A gene-expression signature as a predictor of survival in breast cancer. , 2002, The New England journal of medicine.

[35]  Hanlee P. Ji,et al.  The MicroArray Quality Control (MAQC) project shows inter- and intraplatform reproducibility of gene expression measurements. , 2006, Nature biotechnology.

[36]  Baolin Wu,et al.  Cancer outlier differential gene expression detection. , 2007, Biostatistics.

[37]  S. Dudoit,et al.  Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data , 2002 .

[38]  H. Höfler,et al.  Decentral gene expression analysis for ER+/Her2− breast cancer: results of a proficiency testing program for the EndoPredict assay , 2012, Virchows Archiv.

[39]  David Cameron,et al.  A stroma-related gene signature predicts resistance to neoadjuvant chemotherapy in breast cancer , 2009, Nature Medicine.

[40]  J. Becker,et al.  The Universal Character of the Tumor-Associated Antigen Survivin , 2007, Clinical Cancer Research.

[41]  Fatima Cardoso,et al.  The MINDACT trial: The first prospective clinical validation of a genomic tool , 2007, Molecular oncology.