Microarray-based classification and clinical predictors: on combined classifiers and additional predictive value

MOTIVATION In the context of clinical bioinformatics methods are needed for assessing the additional predictive value of microarray data compared to simple clinical parameters alone. Such methods should also provide an optimal prediction rule making use of all potentialities of both types of data: they should ideally be able to catch subtypes which are not identified by clinical parameters alone. Moreover, they should address the question of the additional predictive value of microarray data in a fair framework. RESULTS We propose a novel but simple two-step approach based on random forests and partial least squares (PLS) dimension reduction embedding the idea of pre-validation suggested by Tibshirani and colleagues, which is based on an internal cross-validation for avoiding overfitting. Our approach is fast, flexible and can be used both for assessing the overall additional significance of the microarray data and for building optimal hybrid classification rules. Its efficiency is demonstrated through simulations and an application to breast cancer and colorectal cancer data. AVAILABILITY Our method is implemented in the freely available R package 'MAclinical' which can be downloaded from http://www.stat.uni-muenchen.de/~socher/MAclinical

[1]  J. Ioannidis Microarrays and molecular research: noise discovery? , 2005, The Lancet.

[2]  N. Kasabov,et al.  Multiple Gene Expression Classifiers from Different Array Platforms Predict Poor Prognosis of Colorectal Cancer , 2007, Clinical Cancer Research.

[3]  R. Tibshirani,et al.  Statistical Applications in Genetics and Molecular Biology Pre-validation and inference in microarrays , 2011 .

[4]  Kjell Johnson,et al.  Evaluating Methods for Classifying Expression Data , 2004, Journal of biopharmaceutical statistics.

[5]  K. Hornik,et al.  Unbiased Recursive Partitioning: A Conditional Inference Framework , 2006 .

[6]  Anne-Laure Boulesteix,et al.  Partial least squares: a versatile tool for the analysis of high-dimensional genomic data , 2006, Briefings Bioinform..

[7]  M. Fernö,et al.  "Good Old" clinical markers have similar power in breast cancer prognosis as microarray gene expression profilers. , 2004, European journal of cancer.

[8]  Bart De Moor,et al.  Predicting the prognosis of breast cancer by integrating clinical and microarray data with Bayesian networks , 2006, ISMB.

[9]  Anne-Laure Boulesteix,et al.  WilcoxCV: an R package for fast variable selection in cross-validation , 2007, Bioinform..

[10]  Yudong D. He,et al.  Gene expression profiling predicts clinical outcome of breast cancer , 2002, Nature.

[11]  Robert Tibshirani,et al.  A study of pre-validation , 2008, 0807.4105.

[12]  Muin J. Khoury,et al.  Letting the genome out of the bottle--will we get our wish? , 2008, The New England journal of medicine.

[13]  Ramón Díaz-Uriarte,et al.  Gene selection and classification of microarray data using random forest , 2006, BMC Bioinformatics.

[14]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[15]  Danh V. Nguyen,et al.  Tumor classification by partial least squares using microarray gene expression data , 2002, Bioinform..

[16]  Achim Zeileis,et al.  Bias in random forest variable importance measures: Illustrations, sources and a solution , 2007, BMC Bioinformatics.

[17]  Anne-Laure Boulesteix,et al.  Reader's Reaction to "Dimension Reduction for Classification with Gene Expression Microarray Data" by Dai et al (2006) , 2006, Statistical applications in genetics and molecular biology.

[18]  Javed Khan,et al.  Gene expression profile in multiple sclerosis patients and healthy controls: identifying pathways relevant to disease. , 2003, Human molecular genetics.

[19]  L. Holmberg,et al.  Gene expression profiling spares early breast cancer patients from adjuvant therapy: derived and validated in two population-based cohorts , 2005, Breast Cancer Research.

[20]  P. Garthwaite An Interpretation of Partial Least Squares , 1994 .

[21]  Constantin F. Aliferis,et al.  A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis , 2004, Bioinform..

[22]  Stefan Michiels,et al.  Prediction of cancer outcome with microarrays: a multiple random validation strategy , 2005, The Lancet.

[23]  A. Dupuy,et al.  Critical review of published microarray studies for cancer outcome and guidelines on statistical analysis and reporting. , 2007, Journal of the National Cancer Institute.

[24]  Li Liu,et al.  Improved breast cancer prognosis through the combination of clinical and genetic markers , 2007, Bioinform..

[25]  J. Ioannidis,et al.  Predictive ability of DNA microarrays for cancer outcomes and correlates: an empirical assessment , 2003, The Lancet.

[26]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[27]  Peter Bühlmann,et al.  Finding predictive gene groups from microarray data , 2004 .

[28]  A. Boulesteix PLS Dimension Reduction for Classification with Microarray Data , 2004, Statistical applications in genetics and molecular biology.

[29]  David M. Rocke,et al.  Dimension Reduction for Classification with Gene Expression Microarray Data , 2006, Statistical applications in genetics and molecular biology.

[30]  M. Daumer,et al.  Evaluating Microarray-based Classifiers: An Overview , 2008, Cancer informatics.

[31]  M. Stone Continuum regression: Cross-validated sequentially constructed prediction embracing ordinary least s , 1990 .

[32]  Harald Binder,et al.  Allowing for mandatory covariates in boosting estimation of sparse high-dimensional survival models , 2008, BMC Bioinformatics.

[33]  Gerhard Tutz,et al.  Boosting ridge regression , 2007, Comput. Stat. Data Anal..

[34]  M. Forina,et al.  Multivariate calibration. , 2007, Journal of chromatography. A.

[35]  M. Barker,et al.  Partial least squares for discrimination , 2003 .

[36]  S. D. Jong SIMPLS: an alternative approach to partial least squares regression , 1993 .

[37]  Annette M. Molinaro,et al.  Prediction error estimation: a comparison of resampling methods , 2005, Bioinform..