Mistakes in validating the accuracy of a prediction classifier in high-dimensional but small-sample microarray data

A major interest in gene expression microarray studies is to develop an accurate classifier which can be adopted in clinical practice. The usage of large numbers of genes with small data samples may lead to overfitting in classification, and generate promising, but often nonreproducible results. Therefore, assessing the reproducibility of a classifier is necessary. Appropriate methods for validating a developed classifier and estimating its predicting accuracy are discussed. In addition, some mistakes that can arise in the cross validation process are reviewed using published articles in prominent medical journals, to prevent the indefinite results of a classifier development from leading to inappropriate treatment.

[1]  Annette M. Molinaro,et al.  Prediction error estimation: a comparison of resampling methods , 2005, Bioinform..

[2]  R. Warnke,et al.  Immune signatures in follicular lymphoma. , 2005, The New England journal of medicine.

[3]  Yudong D. He,et al.  Gene expression profiling predicts clinical outcome of breast cancer , 2002, Nature.

[4]  M. Radmacher,et al.  Pitfalls in the use of DNA microarray data for diagnostic and prognostic classification. , 2003, Journal of the National Cancer Institute.

[5]  Grace S. Shieh,et al.  Comparison of Support Vector Machines to Other Classifiers Using Gene Expression Data , 2006 .

[6]  Geoffrey J McLachlan,et al.  Selection bias in gene extraction on the basis of microarray gene-expression data , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[7]  Van,et al.  A gene-expression signature as a predictor of survival in breast cancer. , 2002, The New England journal of medicine.

[8]  S. Dudoit,et al.  Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data , 2002 .

[9]  Li M Fu,et al.  Multi‐class cancer subtype classification based on gene expression signatures with reliability analysis , 2004, FEBS letters.

[10]  Meland,et al.  The use of molecular profiling to predict survival after chemotherapy for diffuse large-B-cell lymphoma. , 2002, The New England journal of medicine.

[11]  Richard M. Simon,et al.  A Paradigm for Class Prediction Using Gene Expression Profiles , 2003, J. Comput. Biol..

[12]  Geoffrey J. McLachlan,et al.  Analyzing Microarray Gene Expression Data , 2004 .

[13]  Manuela Gariboldi,et al.  Limits of predictive models using microarray data for breast cancer clinical treatment outcome. , 2005, Journal of the National Cancer Institute.

[14]  R. Simon,et al.  Development and validation of therapeutically relevant multi-gene biomarker classifiers. , 2005, Journal of the National Cancer Institute.

[15]  Gary Longton,et al.  Limitations of the Odds Ratio in Gauging the Performance of a Diagnostic or Prognostic Marker , 2004 .

[16]  Wei Wang,et al.  A two-gene expression ratio predicts clinical outcome in breast cancer patients treated with tamoxifen. , 2004, Cancer cell.

[17]  M. Xiong,et al.  Recursive partitioning for tumor classification with gene expression microarray data , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[18]  Joanna H Shih,et al.  Appropriateness of some resampling‐based inference procedures for assessing performance of prognostic classifiers derived from microarray data , 2007, Statistics in medicine.

[19]  Edward R. Dougherty,et al.  Small Sample Issues for Microarray-Based Classification , 2001, Comparative and functional genomics.

[20]  R. Tibshirani,et al.  Semi-Supervised Methods to Predict Patient Survival from Gene Expression Data , 2004, PLoS biology.

[21]  L. Staudt,et al.  Prediction of survival in follicular lymphoma based on molecular features of tumor-infiltrating immune cells. , 2004, The New England journal of medicine.

[22]  D. Ransohoff Rules of evidence for cancer molecular-marker discovery and validation , 2004, Nature Reviews Cancer.

[23]  F. Herrmann,et al.  Expression of cell cycle proteins in T1a and T1b urothelial bladder carcinoma and their value in predicting tumor progression , 2004, Cancer.

[24]  Edward R. Dougherty,et al.  Is cross-validation valid for small-sample microarray classification? , 2004, Bioinform..

[25]  Kerrie L. Mengersen,et al.  Classification based upon gene expression data: bias and precision of error rates , 2007, Bioinform..