A comparative review of statistical methods for discovering differentially expressed genes in replicated microarray experiments

MOTIVATION A common task in analyzing microarray data is to determine which genes are differentially expressed across two kinds of tissue samples or samples obtained under two experimental conditions. Recently several statistical methods have been proposed to accomplish this goal when there are replicated samples under each condition. However, it may not be clear how these methods compare with each other. Our main goal here is to compare three methods, the t-test, a regression modeling approach (Thomas et al., Genome Res., 11, 1227-1236, 2001) and a mixture model approach (Pan et al., http://www.biostat.umn.edu/cgi-bin/rrs?print+2001,2001a,b) with particular attention to their different modeling assumptions. RESULTS It is pointed out that all the three methods are based on using the two-sample t-statistic or its minor variation, but they differ in how to associate a statistical significance level to the corresponding statistic, leading to possibly large difference in the resulting significance levels and the numbers of genes detected. In particular, we give an explicit formula for the test statistic used in the regression approach. Using the leukemia data of Golub et al. (Science, 285, 531-537, 1999), we illustrate these points. We also briefly compare the results with those of several other methods, including the empirical Bayesian method of Efron et al. (J. Am. Stat. Assoc., to appear, 2001) and the Significance Analysis of Microarray (SAM) method of Tusher et al. (PROC: Natl Acad. Sci. USA, 98, 5116-5121, 2001).

[1]  A. F. Smith,et al.  Statistical analysis of finite mixture distributions , 1986 .

[2]  Peter Adams,et al.  The EMMIX software for the fitting of mixtures of normal and t-components , 1999 .

[3]  Robert Tibshirani,et al.  Microarrays and Their Use in a Comparative Experiment , 2000 .

[4]  Wei Pan,et al.  A mixture model approach to detecting differentially expressed genes with microarray data , 2003, Functional & Integrative Genomics.

[5]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[6]  John D. Storey,et al.  Empirical Bayes Analysis of a Microarray Experiment , 2001 .

[7]  B. L. Welch The generalisation of student's problems when several different population variances are involved. , 1947, Biometrika.

[8]  Welch Bl THE GENERALIZATION OF ‘STUDENT'S’ PROBLEM WHEN SEVERAL DIFFERENT POPULATION VARLANCES ARE INVOLVED , 1947 .

[9]  William H. Press,et al.  The Art of Scientific Computing Second Edition , 1998 .

[10]  Gary A. Churchill,et al.  Analysis of Variance for Gene Expression Microarray Data , 2000, J. Comput. Biol..

[11]  G. A. Whitmore,et al.  Importance of replication in microarray gene expression studies: statistical methods and evidence from repetitive cDNA hybridizations. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[12]  Terence P. Speed,et al.  Comparison of Methods for Image Analysis on cDNA Microarray Data , 2002 .

[13]  Brian S. Yandell,et al.  Mining for Low-abundance Transcripts in Microarray Data , 2001 .

[14]  J. Thomas,et al.  An efficient and robust statistical modeling approach to discover differentially expressed genes using genomic expression profiles. , 2001, Genome research.

[15]  E. Lander Array of hope , 1999, Nature Genetics.

[16]  L. P. Zhao,et al.  Statistical modeling of large microarray data sets to identify stimulus-response profiles , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[17]  William H. Press,et al.  Numerical Recipes in Fortran 77: The Art of Scientific Computing 2nd Editionn - Volume 1 of Fortran Numerical Recipes , 1992 .

[18]  W. Pan,et al.  How many replicates of arrays are required to detect gene expression changes in microarray experiments? A mixture model approach , 2002, Genome Biology.

[19]  S. Dudoit,et al.  STATISTICAL METHODS FOR IDENTIFYING DIFFERENTIALLY EXPRESSED GENES IN REPLICATED cDNA MICROARRAY EXPERIMENTS , 2002 .

[20]  J. Devore,et al.  Statistics: The Exploration and Analysis of Data , 1986 .

[21]  Trey Ideker,et al.  Testing for Differentially-Expressed Genes by Maximum-Likelihood Analysis of Microarray Data , 2000, J. Comput. Biol..

[22]  Christina Kendziorski,et al.  On Differential Variability of Expression Ratios: Improving Statistical Inference about Gene Expression Changes from Microarray Data , 2001, J. Comput. Biol..

[23]  C. Li,et al.  Model-based analysis of oligonucleotide arrays: expression index computation and outlier detection. , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[24]  Y. Chen,et al.  Ratio-based decisions and the quantitative analysis of cDNA microarray images. , 1997, Journal of biomedical optics.

[25]  R. Tibshirani,et al.  Significance analysis of microarrays applied to the ionizing radiation response , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[26]  H. Scheffé Practical Solutions of the Behrens-Fisher Problem , 1970 .

[27]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[28]  Peter McCullagh,et al.  [Regression Models for Discrete Longitudinal Responses]: Comment , 1993 .

[29]  Adrian E. Raftery,et al.  How Many Clusters? Which Clustering Method? Answers Via Model-Based Cluster Analysis , 1998, Comput. J..

[30]  John C. W. Rayner,et al.  Welch's approximate solution for the Behrens-Fisher problem , 1987 .