Supervised Methods with Genomic Data: a Review and Cautionary View

We review well accepted methods to address questions about differential expression of genes and class prediction from gene expression data. We highlight some new topics that deserve more attention: testing of differential expression of specific groups of genes, intra-group heterogeneity and class prediction, gene interaction in predictors, visualisation, difficulties in the biological interpretation of predictor genes and molecular signatures, and the use of ROC[Receiver Operating Characteristic curve]-based statistics for evaluating predictors and differential expression. We end with a review of some serious problems that can limit the potential of these methods; we focus specially on inadequate assessment of the performance of new methods (due to inadequate estimation of error rates and to the use of few and “easy” data sets) and failure to recognise observational studies and include needed covariates. A final comment is made about the need for freely available source code.

[1]  Shigeo Abe DrEng Pattern Classification , 2001, Springer London.

[2]  Xiaochun Li,et al.  A Comparison of Parametric Versus Permutation Methods with Applications to General and Temporal Microarray Gene Expression Data , 2003, Bioinform..

[3]  Frank E. Harrell,et al.  Regression Modeling Strategies: With Applications to Linear Models, Logistic Regression, and Survival Analysis , 2001 .

[4]  P. Qiu The Statistical Evaluation of Medical Tests for Classification and Prediction , 2005 .

[5]  David G. Stork,et al.  Pattern Classification (2nd ed.) , 1999 .

[6]  David R. Anderson,et al.  Model selection and multimodel inference : a practical information-theoretic approach , 2003 .

[7]  T. Poggio,et al.  Prediction of central nervous system embryonal tumour outcome based on gene expression , 2002, Nature.

[8]  L. Staudt,et al.  Signatures of the immune response. , 2001, Immunity.

[9]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[10]  D. Stone,et al.  Prediction of clinical drug efficacy by classification of drug-induced genomic expression profiles in vitro , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[11]  Sudhir Srivastava,et al.  Markers for early detection of cancer: Statistical guidelines for nested case-control studies , 2002, BMC medical research methodology.

[12]  X. Cui,et al.  Statistical tests for differential expression in cDNA microarray experiments , 2003, Genome Biology.

[13]  M. Stone Continuum regression: Cross-validated sequentially constructed prediction embracing ordinary least s , 1990 .

[14]  David H. Wolpert,et al.  Stacked generalization , 1992, Neural Networks.

[15]  G. Robinson,et al.  Gene Expression Profiles in the Brain Predict Behavior in Individual Honey Bees , 2003, Science.

[16]  Joaquín Dopazo,et al.  Using gene ontology on genome-scale studies to find significant associations of biologically relevant terms to groups of genes , 2003, 2003 IEEE XIII Workshop on Neural Networks for Signal Processing (IEEE Cat. No.03TH8718).

[17]  Bart Kosko,et al.  Neural networks for signal processing , 1992 .

[18]  Kamesh Munagala,et al.  Cancer characterization and feature set extraction by discriminative margin clustering , 2004, BMC Bioinformatics.

[19]  P. Garthwaite An Interpretation of Partial Least Squares , 1994 .

[20]  Edward R. Dougherty,et al.  Is cross-validation valid for small-sample microarray classification? , 2004, Bioinform..

[21]  Leo Breiman,et al.  Statistical Modeling: The Two Cultures (with comments and a rejoinder by the author) , 2001 .

[22]  Peter B Bach,et al.  Lung cancer in US women: a contemporary epidemic. , 2004, JAMA.

[23]  Andy Liaw,et al.  Classification and Regression by randomForest , 2007 .

[24]  G. Belle Statistical rules of thumb , 2002 .

[25]  Adrian E. Raftery,et al.  Bayesian model averaging: a tutorial (with comments by M. Clyde, David Draper and E. I. George, and a rejoinder by the authors , 1999 .

[26]  Geoffrey J McLachlan,et al.  Selection bias in gene extraction on the basis of microarray gene-expression data , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[27]  John D. Storey,et al.  Statistical significance for genomewide studies , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[28]  S. Dudoit,et al.  Resampling-based multiple testing for microarray data analysis , 2003 .

[29]  Leo Breiman,et al.  Stacked regressions , 2004, Machine Learning.

[30]  Wojtek J. Krzanowski,et al.  Improved biclustering of microarray data demonstrated through systematic performance tests , 2005, Comput. Stat. Data Anal..

[31]  John D. Storey A direct approach to false discovery rates , 2002 .

[32]  D. Edwards,et al.  Statistical Analysis of Gene Expression Microarray Data , 2003 .

[33]  Josée Dupuis,et al.  Mapping complex traits using Random Forests , 2003, BMC Genetics.

[34]  T. Speed,et al.  Design issues for cDNA microarray experiments , 2002, Nature Reviews Genetics.

[35]  Ash A. Alizadeh,et al.  Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling , 2000, Nature.

[36]  Eric R. Ziegel,et al.  The Elements of Statistical Learning , 2003, Technometrics.

[37]  M. Radmacher,et al.  Pitfalls in the use of DNA microarray data for diagnostic and prognostic classification. , 2003, Journal of the National Cancer Institute.

[38]  G. Parmigiani,et al.  A statistical framework for expression‐based molecular classification in cancer , 2002 .

[39]  Wasserman,et al.  Bayesian Model Selection and Model Averaging. , 2000, Journal of mathematical psychology.

[40]  Scott L. Zeger,et al.  The Analysis of Gene Expression Data: An Overview of Methods and Software , 2003 .

[41]  S. Dudoit,et al.  STATISTICAL METHODS FOR IDENTIFYING DIFFERENTIALLY EXPRESSED GENES IN REPLICATED cDNA MICROARRAY EXPERIMENTS , 2002 .

[42]  Todd,et al.  Diffuse large B-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning , 2002, Nature Medicine.

[43]  D. Sackett,et al.  The architecture of diagnostic research , 2002, BMJ : British Medical Journal.

[44]  M. Peruggia Model Selection and Multimodel Inference: A Practical Information-Theoretic Approach (2nd ed.) , 2003 .

[45]  Gerhard Tutz,et al.  Identification of interaction patterns and classification with applications to microarray data , 2006, Comput. Stat. Data Anal..

[46]  Erica A Golemis,et al.  From correlation to causality: microarrays, cancer, and cancer treatment. , 2003, BioTechniques.

[47]  Susan R. Wilson,et al.  Visualisation of Gene Expression Data - the GE-biplot, the Chip-plot and the Gene-plot , 2003, Statistical applications in genetics and molecular biology.

[48]  John Quackenbush,et al.  Open source software for the analysis of microarray data. , 2003, BioTechniques.

[49]  Giovanni Parmigiani,et al.  POE: Statistical Methods for Qualitative Analysis of Gene Expression , 2003 .

[50]  Todd R Golub,et al.  Gene expression–based high-throughput screening(GE-HTS) and application to leukemia differentiation , 2004, Nature Genetics.

[51]  Deepayan Sarkar,et al.  Detecting differential gene expression with a semiparametric hierarchical mixture method. , 2004, Biostatistics.

[52]  James K. Brewer,et al.  Statistical Rules of Thumb , 2003 .

[53]  J. Ioannidis,et al.  Predictive ability of DNA microarrays for cancer outcomes and correlates: an empirical assessment , 2003, The Lancet.

[54]  L. Lazzeroni Plaid models for gene expression data , 2000 .

[55]  Syed Mohsin,et al.  Gene expression profiling for the prediction of therapeutic response to docetaxel in patients with breast cancer , 2003, The Lancet.

[56]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[57]  Peter Bühlmann,et al.  Finding predictive gene groups from microarray data , 2004 .

[58]  M S Pepe,et al.  Phases of biomarker development for early detection of cancer. , 2001, Journal of the National Cancer Institute.

[59]  Ian H. Witten,et al.  Issues in Stacked Generalization , 2011, J. Artif. Intell. Res..

[60]  E. Wit Design and Analysis of DNA Microarray Investigations , 2004, Human Genomics.

[61]  Eliot Marshall,et al.  The UPSIDE of Good Behavior: Make Your Data Freely Available , 2003, Science.

[62]  John D Potter,et al.  Epidemiology, cancer genetics and microarrays: making correct inferences, using appropriate designs. , 2003, Trends in genetics : TIG.

[63]  Bin Yu,et al.  Simultaneous Gene Clustering and Subset Selection for Sample Classification Via MDL , 2003, Bioinform..

[64]  Y Pawitan,et al.  Gene expression profiling for prognosis using Cox regression , 2004, Statistics in medicine.

[65]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[66]  Debashis Kushary,et al.  Bootstrap Methods and Their Application , 2000, Technometrics.

[67]  P. Green Diversities of gifts, but the same spirit , 2003 .

[68]  D. Slonim From patterns to pathways: gene expression data analysis comes of age , 2002, Nature Genetics.

[69]  David R. Bickel,et al.  Degrees of differential gene expression: detecting biologically significant expression differences and estimating their magnitudes , 2004, Bioinform..

[70]  Richard Baumgartner,et al.  Class prediction and discovery using gene microarray and proteomics mass spectroscopy data: curses, caveats, cautions , 2003, Bioinform..

[71]  Fatima Al-Shahrour,et al.  The Use of Go Terms to Understand the Biological Significance of Microarray Differential Gene Expression Data , 2004 .

[72]  S. S. Young,et al.  Resampling-Based Multiple Testing: Examples and Methods for p-Value Adjustment , 1993 .

[73]  J. Friedman,et al.  A Statistical View of Some Chemometrics Regression Tools , 1993 .

[74]  K. Burnham,et al.  Model selection: An integral part of inference , 1997 .

[75]  G. Churchill Fundamentals of experimental design for cDNA microarrays , 2002, Nature Genetics.

[76]  D. Ransohoff Rules of evidence for cancer molecular-marker discovery and validation , 2004, Nature Reviews Cancer.

[77]  Peter Congdon,et al.  Applied Bayesian Modelling , 2003 .

[78]  Danh V. Nguyen,et al.  Multi-class cancer classification via partial least squares with gene expression profiles , 2002, Bioinform..

[79]  M. Kenward,et al.  An Introduction to the Bootstrap , 2007 .

[80]  Gordon K Smyth,et al.  Linear Models and Empirical Bayes Methods for Assessing Differential Expression in Microarray Experiments , 2004, Statistical applications in genetics and molecular biology.

[81]  S. Dudoit,et al.  Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data , 2002 .

[82]  M. Daly,et al.  PGC-1α-responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes , 2003, Nature Genetics.

[83]  R. Somorjai,et al.  Distinguishing normal from rejecting renal allografts: application of a three—stage classification strategy to MR and IR spectra of urine , 2002 .

[84]  K. Dobbin,et al.  Experimental design of DNA microarray experiments. , 2003, BioTechniques.

[85]  Wei Pan,et al.  Linear regression and two-class classification with gene expression data , 2003, Bioinform..

[86]  Yoav Benjamini,et al.  Identifying differentially expressed genes using false discovery rate controlling procedures , 2003, Bioinform..

[87]  Yongchao Ge Resampling-based Multiple Testing for Microarray Data Analysis , 2003 .

[88]  R. Tibshirani,et al.  Improvements on Cross-Validation: The 632+ Bootstrap Method , 1997 .

[89]  Gerhard Tutz,et al.  A CART-based approach to discover emerging patterns in microarray data , 2003, Bioinform..

[90]  J. Booth,et al.  Resampling-Based Multiple Testing. , 1994 .

[91]  Stefano Toppo,et al.  Pattern recognition in gene expression profiling using DNA array: a comparative study of different statistical methods applied to cancer classification. , 2003, Human molecular genetics.

[92]  Sandya Liyanarachchi,et al.  A high performance test of differential gene expression for oligonucleotide arrays , 2003, Genome Biology.

[93]  Giovanni Parmigiani,et al.  Statistical modeling and visualization of molecular profiles in cancer. , 2003, BioTechniques.

[94]  Avrum Spira,et al.  Guidelines: Expression profiling — best practices for data generation and interpretation in clinical trials , 2004 .

[95]  Kimberly F. Johnson Methods of Microarray Data Analysis II , 2002, Springer US.

[96]  D. Ghosh Penalized Discriminant Methods for the Classification of Tumors from Gene Expression Data , 2003, Biometrics.

[97]  D. Collett,et al.  Modelling Binary Data , 1991 .

[98]  Yoshua Bengio,et al.  Pattern Recognition and Neural Networks , 1995 .

[99]  Ian T. Jolliffe,et al.  Variable selection and the interpretation of principal subspaces , 2001 .

[100]  M. Schummer,et al.  Selecting Differentially Expressed Genes from Microarray Experiments , 2003, Biometrics.

[101]  Jelle J. Goeman,et al.  A global test for groups of genes: testing association with a clinical outcome , 2004, Bioinform..

[102]  R. Díaz-Uriarte A simple method for finding molecular signatures from gene expression data , 2004 .

[103]  Lu Tian,et al.  Linking gene expression data with patient survival times using partial least squares , 2002, ISMB.

[104]  L. Staudt,et al.  The use of molecular profiling to predict survival after chemotherapy for diffuse large-B-cell lymphoma. , 2002, The New England journal of medicine.

[105]  L. Staudt,et al.  Molecular Diagnosis of Primary Mediastinal B Cell Lymphoma Identifies a Clinically Favorable Subgroup of Diffuse Large B Cell Lymphoma Related to Hodgkin Lymphoma , 2003, The Journal of experimental medicine.