Evaluating Microarray-based Classifiers: An Overview

For the last eight years, microarray-based class prediction has been the subject of numerous publications in medicine, bioinformatics and statistics journals. However, in many articles, the assessment of classification accuracy is carried out using suboptimal procedures and is not paid much attention. In this paper, we carefully review various statistical aspects of classifier evaluation and validation from a practical point of view. The main topics addressed are accuracy measures, error rate estimation procedures, variable selection, choice of classifiers and validation strategy.

[1]  Gert R. G. Lanckriet,et al.  Classification of a large microarray data set: algorithm comparison and analysis of drug signatures. , 2005, Genome research.

[2]  Anne-Laure Boulesteix,et al.  Partial least squares: a versatile tool for the analysis of high-dimensional genomic data , 2006, Briefings Bioinform..

[3]  Ian B. Jeffery,et al.  Comparison and evaluation of methods for generating differentially expressed gene lists from microarray data , 2006, BMC Bioinformatics.

[4]  Astrid A. Prinz,et al.  Independent Component Analysis-motivated Approach to Classificatory Decomposition of Cortical Evoked Potentials , 2006, BMC Bioinformatics.

[5]  R. Tibshirani,et al.  Diagnosis of multiple cancer types by shrunken centroids of gene expression , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[6]  Anne-Laure Boulesteix,et al.  Microarray-based prediction of tumor response to neoadjuvant radiochemotherapy of patients with locally advanced rectal cancer. , 2008, Clinical gastroenterology and hepatology : the official clinical practice journal of the American Gastroenterological Association.

[7]  James J. Chen,et al.  Key aspects of analyzing microarray gene-expression data. , 2007, Pharmacogenomics.

[8]  Yoshua Bengio,et al.  Pattern Recognition and Neural Networks , 1995 .

[9]  Russ B. Altman,et al.  Nonparametric methods for identifying differentially expressed genes in microarray data , 2002, Bioinform..

[10]  D. Ghosh Penalized Discriminant Methods for the Classification of Tumors from Gene Expression Data , 2003, Biometrics.

[11]  Katja Ickstadt,et al.  Reducing the probability of false positive research findings by pre-publication validation – Experience with a large multiple sclerosis database , 2015 .

[12]  Thomas Augustin,et al.  Some recent advances in measurement error models and methods , 2006 .

[13]  L. V. van't Veer,et al.  Validation and clinical utility of a 70-gene prognostic signature for women with node-negative breast cancer. , 2006, Journal of the National Cancer Institute.

[14]  Constantin F. Aliferis,et al.  GEMS: A system for automated cancer diagnosis and biomarker discovery from microarray gene expression data , 2005, Int. J. Medical Informatics.

[15]  Schumacher Martin,et al.  Adapting Prediction Error Estimates for Biased Complexity Selection in High-Dimensional Bootstrap Samples , 2008 .

[16]  R. A. Bradley,et al.  RANK ANALYSIS OF INCOMPLETE BLOCK DESIGNS , 1952 .

[17]  J A Swets,et al.  Measuring the accuracy of diagnostic systems. , 1988, Science.

[18]  E. Purdom,et al.  Statistical Applications in Genetics and Molecular Biology Error Distribution for Gene Expression Data , 2011 .

[19]  Danh V. Nguyen,et al.  Tumor classification by partial least squares using microarray gene expression data , 2002, Bioinform..

[20]  Goldberg,et al.  Genetic algorithms , 1993, Robust Control Systems with Genetic Algorithms.

[21]  Nir Friedman,et al.  Tissue classification with gene expression profiles. , 2000 .

[22]  Thomas Augustin,et al.  Some recent advances in measurement error models and methods , 2006 .

[23]  J. Hanley,et al.  The meaning and use of the area under a receiver operating characteristic (ROC) curve. , 1982, Radiology.

[24]  Korbinian Strimmer,et al.  Statistical Applications in Genetics and Molecular Biology , 2005 .

[25]  J. Berger Statistical Decision Theory and Bayesian Analysis , 1988 .

[26]  Sudhir Gupta,et al.  Statistical Regression With Measurement Error , 1999, Technometrics.

[27]  Francesco Falciani,et al.  GALGO: an R package for multivariate variable selection using genetic algorithms , 2006, Bioinform..

[28]  D J Spiegelhalter,et al.  Probabilistic prediction in patient management and clinical trials. , 1986, Statistics in medicine.

[29]  T. Hastie,et al.  Classification of gene microarrays by penalized logistic regression. , 2004, Biostatistics.

[30]  Kurt Hornik,et al.  The Design and Analysis of Benchmark Experiments , 2005 .

[31]  D G Altman,et al.  What do we mean by validating a prognostic model? , 2000, Statistics in medicine.

[32]  E. DeLong,et al.  Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. , 1988, Biometrics.

[33]  R. A. Bradley,et al.  RANK ANALYSIS OF INCOMPLETE BLOCK DESIGNS THE METHOD OF PAIRED COMPARISONS , 1952 .

[34]  Christopher G. Chute,et al.  Cancer Informatics , 2002, Health Informatics.

[35]  Patrick Tan,et al.  Genetic algorithms applied to multi-class prediction for the analysis of gene expression data , 2003, Bioinform..

[36]  Anne-Laure Boulesteix,et al.  WilcoxCV: an R package for fast variable selection in cross-validation , 2007, Bioinform..

[37]  Marco Zaffalon,et al.  Reliable diagnoses of dementia by the naive credal classifier inferred from incomplete cognitive data , 2003, Artif. Intell. Medicine.

[38]  Jae K. Lee,et al.  Developing Optimal Prediction Models for Cancer Classification Using Gene Expression Data , 2004, J. Bioinform. Comput. Biol..

[39]  S. Dudoit,et al.  Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data , 2002 .

[40]  Raymond J. Carroll,et al.  Measurement error in nonlinear models: a modern perspective , 2006 .

[41]  Musa H. Asyali,et al.  Gene Expression Profile Classification: A Review , 2006 .

[42]  Richard Simon,et al.  Bias in error estimation when using cross-validation for model selection , 2006, BMC Bioinformatics.

[43]  Stefano Toppo,et al.  Pattern recognition in gene expression profiling using DNA array: a comparative study of different statistical methods applied to cancer classification. , 2003, Human molecular genetics.

[44]  Hanna Göransson,et al.  Improved variance estimation of classification performance via reduction of bias caused by small sample size , 2006, BMC Bioinformatics.

[45]  Annette M. Molinaro,et al.  Prediction error estimation: a comparison of resampling methods , 2005, Bioinform..

[46]  Wei Pan,et al.  A comparative study of discriminating human heart failure etiology using gene expression profiles , 2005, BMC Bioinformatics.

[47]  Ron Kohavi,et al.  Useful Feature Subsets and Rough Set Reducts , 1994 .

[48]  Janusz Zalewski,et al.  Rough sets: Theoretical aspects of reasoning about data , 1996 .

[49]  Trevor Hastie,et al.  Regularized linear discriminant analysis and its application in microarrays. , 2007, Biostatistics.

[50]  Jae Won Lee,et al.  An extensive comparison of recent classification tools applied to microarray data , 2004, Comput. Stat. Data Anal..

[51]  J. Ioannidis,et al.  Predictive ability of DNA microarrays for cancer outcomes and correlates: an empirical assessment , 2003, The Lancet.

[52]  Harald Binder,et al.  Adapting Prediction Error Estimates for Biased Complexity Selection in High-Dimensional Bootstrap Samples , 2008, Statistical applications in genetics and molecular biology.

[53]  Jane Yung-jen Hsu,et al.  Fuzzy classification trees for data analysis , 2002, Fuzzy Sets Syst..

[54]  Werner Dubitzky,et al.  Avoiding model selection bias in small-sample genomic datasets , 2006, Bioinform..

[55]  R. Simon,et al.  Development and validation of therapeutically relevant multi-gene biomarker classifiers. , 2005, Journal of the National Cancer Institute.

[56]  Peter Bühlmann,et al.  Boosting for Tumor Classification with Gene Expression Data , 2003, Bioinform..

[57]  Javed Khan,et al.  Gene expression profile in multiple sclerosis patients and healthy controls: identifying pathways relevant to disease. , 2003, Human molecular genetics.

[58]  James O. Berger,et al.  Statistical Decision Theory and Bayesian Analysis, Second Edition , 1985 .

[59]  Koby Crammer,et al.  Analysis of Representations for Domain Adaptation , 2006, NIPS.

[60]  Thomas A Gerds,et al.  Efron‐Type Measures of Prediction Error for Survival Analysis , 2007, Biometrics.

[61]  R. Gentleman,et al.  Classification Using Generalized Partial Least Squares , 2005 .

[62]  Kjell Johnson,et al.  Evaluating Methods for Classifying Expression Data , 2004, Journal of biopharmaceutical statistics.

[63]  T. H. Bø,et al.  New feature subset selection procedures for classification of expression profiles , 2002, Genome Biology.

[64]  U. Alon,et al.  Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[65]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[66]  Constantin F. Aliferis,et al.  A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis , 2004, Bioinform..

[67]  Jae K. Lee,et al.  Robust classification modeling on microarray data using misclassification penalized posterior , 2005, ISMB.

[68]  Marcel Dettling,et al.  BagBoosting for tumor classification with gene expression data , 2004, Bioinform..

[69]  Kurt Hornik,et al.  Deriving Consensus Rankings from Benchmarking Experiments , 2006, GfKl.

[70]  Kerrie L. Mengersen,et al.  Classification based upon gene expression data: bias and precision of error rates , 2007, Bioinform..

[71]  Concha Bielza,et al.  Machine Learning in Bioinformatics , 2008, Encyclopedia of Database Systems.

[72]  G. Stolovitzky Gene selection in microarray data: the elephant, the blind men and our algorithms. , 2003, Current opinion in structural biology.

[73]  Shai Ben-David,et al.  Detecting Change in Data Streams , 2004, VLDB.

[74]  Wolfgang Huber,et al.  A Compendium to Ensure Computational Reproducibility in High-Dimensional Classification Tasks , 2004, Statistical applications in genetics and molecular biology.

[75]  David M. Rocke,et al.  A Model for Measurement Error for Gene Expression Arrays , 2001, J. Comput. Biol..

[76]  Wenjiang J. Fu,et al.  Estimating misclassification error with small samples via bootstrap cross-validation , 2005, Bioinform..

[77]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[78]  Andrzej Skowron,et al.  Rough set methods in feature selection and recognition , 2003, Pattern Recognit. Lett..

[79]  Anne-Laure Boulesteix,et al.  Survival prediction using gene expression data: A review and comparison , 2009, Comput. Stat. Data Anal..

[80]  A. Dupuy,et al.  Critical review of published microarray studies for cancer outcome and guidelines on statistical analysis and reporting. , 2007, Journal of the National Cancer Institute.

[81]  StatnikovAlexander,et al.  A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis , 2005 .

[82]  A. Boulesteix PLS Dimension Reduction for Classification with Microarray Data , 2004, Statistical applications in genetics and molecular biology.

[83]  Eric R. Ziegel,et al.  The Elements of Statistical Learning , 2003, Technometrics.

[84]  M. Radmacher,et al.  Pitfalls in the use of DNA microarray data for diagnostic and prognostic classification. , 2003, Journal of the National Cancer Institute.

[85]  Marco Zaffalon The naive credal classifier , 2002 .

[86]  Yudong D. He,et al.  Gene expression profiling predicts clinical outcome of breast cancer , 2002, Nature.

[87]  Geoffrey J McLachlan,et al.  Selection bias in gene extraction on the basis of microarray gene-expression data , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[88]  Anne-Laure Boulesteix,et al.  Reader's Reaction to "Dimension Reduction for Classification with Gene Expression Microarray Data" by Dai et al (2006) , 2006, Statistical applications in genetics and molecular biology.

[89]  Dingfang Li,et al.  Gene Selection Using Rough Set Theory , 2006, RSKT.

[90]  Walter L. Ruzzo,et al.  Improved Gene Selection for Classification of Microarrays , 2002, Pacific Symposium on Biocomputing.

[91]  T. Wansbeek Measurement error and latent variables in econometrics , 2000 .

[92]  Joseph G Ibrahim,et al.  Bayesian Error‐in‐Variable Survival Model for the Analysis of GeneChip Arrays , 2005, Biometrics.

[93]  J. Ioannidis Microarrays and molecular research: noise discovery? , 2005, The Lancet.

[94]  Nello Cristianini,et al.  Support vector machine classification and validation of cancer tissue samples using microarray expression data , 2000, Bioinform..

[95]  Ramón Díaz-Uriarte,et al.  Gene selection and classification of microarray data using random forest , 2006, BMC Bioinformatics.

[96]  R. Tibshirani,et al.  Improvements on Cross-Validation: The 632+ Bootstrap Method , 1997 .

[97]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[98]  David J. Hand,et al.  Classifier Technology and the Illusion of Progress , 2006, math/0606441.

[99]  Gordon K Smyth,et al.  Linear Models and Empirical Bayes Methods for Assessing Differential Expression in Microarray Experiments , 2004, Statistical applications in genetics and molecular biology.

[100]  Guy Perrière,et al.  MADE4: an R package for multivariate analysis of gene expression data , 2005, Bioinform..

[101]  B. Efron,et al.  A Leisurely Look at the Bootstrap, the Jackknife, and , 1983 .

[102]  David E. Goldberg,et al.  Genetic Algorithms in Search Optimization and Machine Learning , 1988 .

[103]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[104]  M. Zhang,et al.  A rough sets based approach to feature selection , 2004, IEEE Annual Meeting of the Fuzzy Information, 2004. Processing NAFIPS '04..

[105]  Mee Young Park,et al.  L1‐regularization path algorithm for generalized linear models , 2007 .

[106]  R. Simon,et al.  Effectiveness of gene expression profiling for response prediction of rectal adenocarcinomas to preoperative chemoradiotherapy. , 2005, Journal of clinical oncology : official journal of the American Society of Clinical Oncology.

[107]  Achim Zeileis,et al.  Bias in random forest variable importance measures: Illustrations, sources and a solution , 2007, BMC Bioinformatics.

[108]  Gersende Fort,et al.  Classification using partial least squares with penalized logistic regression , 2005, Bioinform..

[109]  H. Schäfer,et al.  Efficient confidence bounds for ROC curves. , 1994, Statistics in medicine.

[110]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[111]  Edward R. Dougherty,et al.  Is cross-validation valid for small-sample microarray classification? , 2004, Bioinform..