Gene expression A note on using permutation-based false discovery rate estimates to compare different analysis methods for microarray data

MOTIVATION False discovery rate (FDR) is defined as the expected percentage of false positives among all the claimed positives. In practice, with the true FDR unknown, an estimated FDR can serve as a criterion to evaluate the performance of various statistical methods under the condition that the estimated FDR approximates the true FDR well, or at least, it does not improperly favor or disfavor any particular method. Permutation methods have become popular to estimate FDR in genomic studies. The purpose of this paper is 2-fold. First, we investigate theoretically and empirically whether the standard permutation-based FDR estimator is biased, and if so, whether the bias inappropriately favors or disfavors any method. Second, we propose a simple modification of the standard permutation to yield a better FDR estimator, which can in turn serve as a more fair criterion to evaluate various statistical methods. RESULTS Both simulated and real data examples are used for illustration and comparison. Three commonly used test statistics, the sample mean, SAM statistic and Student's t-statistic, are considered. The results show that the standard permutation method overestimates FDR. The overestimation is the most severe for the sample mean statistic while the least for the t-statistic with the SAM-statistic lying between the two extremes, suggesting that one has to be cautious when using the standard permutation-based FDR estimates to evaluate various statistical methods. In addition, our proposed FDR estimation method is simple and outperforms the standard method.

[1]  A. Khodursky,et al.  Evolutionary genomics of ecological specialization. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[2]  D. Lambert Robust Two-Sample Permutation Tests , 1985 .

[3]  L. Wasserman,et al.  Analysis of multilocus models of association , 2003, Genetic epidemiology.

[4]  Hongyu Zhao,et al.  Parametric and Nonparametric FDR Estimation Revisited , 2006, Biometrics.

[5]  T. Speed,et al.  Statistical issues in cDNA microarray data analysis. , 2003, Methods in molecular biology.

[6]  Baolin Wu,et al.  Model-Based Approach to FDR Estimation , 2004 .

[7]  A. Khodursky,et al.  Escherichia coli spotted double-strand DNA microarrays: RNA extraction, labeling, hybridization, quality control, and data management. , 2003, Methods in molecular biology.

[8]  Marcello Pagano,et al.  Efficient Calculation of the Permutation Distribution of Trimmed Means , 1991 .

[9]  Werner A. Stahel,et al.  Robust Statistics: The Approach Based on Influence Functions , 1987 .

[10]  Wei Pan,et al.  On the Use of Permutation in and the Performance of A Class of Nonparametric Methods to Detect Differential Gene Expression , 2003, Bioinform..

[11]  John J. Wyrick,et al.  Genome-wide location and function of DNA binding proteins. , 2000, Science.

[12]  Sylvia Richardson,et al.  Bayesian Hierarchical Model for Identifying Changes in Gene Expression from Microarray Experiments , 2002, J. Comput. Biol..

[13]  Robert Tibshirani,et al.  Statistical Significance for Genome-Wide Experiments , 2003 .

[14]  John D. Storey,et al.  Strong control, conservative point estimation and simultaneous conservative consistency of false discovery rates: a unified approach , 2004 .

[15]  Wei Pan,et al.  A mixture model approach to detecting differentially expressed genes with microarray data , 2003, Functional & Integrative Genomics.

[16]  Jinyan Li,et al.  Twelve C2H2 zinc-finger genes on human chromosome 19 can be each translated into the same type of protein after frameshifts , 2004, Bioinform..

[17]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[18]  D. Botstein,et al.  DNA microarray analysis of gene expression in response to physiological and genetic changes that affect tryptophan metabolism in Escherichia coli. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[19]  D. Ghosh,et al.  The false discovery rate: a variable selection perspective , 2006 .

[20]  A. Khodursky,et al.  Adaptation to famine: A family of stationary-phase genes revealed by microarray analysis , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[21]  Mark J. van der Laan,et al.  Choice of a null distribution in resampling-based multiple testing , 2004 .

[22]  David B. Allison,et al.  A mixture model approach for the analysis of microarray gene expression data , 2002 .

[23]  Baolin Wu,et al.  Differential gene expression detection using penalized linear regression models: the improved SAM statistics , 2005, Bioinform..

[24]  Wei Chen,et al.  Comparison of seven methods for producing Affymetrix expression scores based on False Discovery Rates in disease profiling data , 2005, BMC Bioinformatics.

[25]  W. R. van Zwet,et al.  Asymptotic Expansions for the Power of Distributionfree Tests in the Two-Sample Problem , 1976 .

[26]  P. Broberg Statistical methods for ranking differentially expressed genes , 2003, Genome Biology.

[27]  R. Tibshirani,et al.  Significance analysis of microarrays applied to the ionizing radiation response , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[28]  Chen-An Tsai,et al.  Estimation of False Discovery Rates in Multiple Testing: Application to Gene Microarray Data , 2003, Biometrics.

[29]  D. Botstein,et al.  Genomic binding sites of the yeast cell-cycle transcription factors SBF and MBF , 2001, Nature.

[30]  L. Qin,et al.  Empirical evaluation of data transformations and ranking statistics for microarray analysis. , 2004, Nucleic acids research.

[31]  Ingrid Lönnstedt Replicated microarray data , 2001 .

[32]  John D. Storey,et al.  Statistical significance for genomewide studies , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[33]  John D. Storey,et al.  Empirical Bayes Analysis of a Microarray Experiment , 2001 .

[34]  John D. Storey A direct approach to false discovery rates , 2002 .

[35]  Alex Lewin,et al.  A mixture model-based strategy for selecting sets of genes in multiclass response microarray experiments , 2004, Bioinform..

[36]  Richard Simon,et al.  A random variance model for detection of differential gene expression in small microarray experiments , 2003, Bioinform..

[37]  Stan Pounds,et al.  Estimating the Occurrence of False Positives and False Negatives in Microarray Studies by Approximating and Partitioning the Empirical Distribution of P-values , 2003, Bioinform..

[38]  C. Jennison,et al.  Robust Statistics: The Approach Based on Influence Functions , 1987 .

[39]  Thierry Moreau,et al.  A simple procedure for estimating the false discovery rate , 2005, Bioinform..

[40]  I. Johnstone,et al.  Adapting to unknown sparsity by controlling the false discovery rate , 2005, math/0505374.

[41]  Pierre Baldi,et al.  A Bayesian framework for the analysis of microarray expression data: regularized t -test and statistical inferences of gene changes , 2001, Bioinform..

[42]  Mark J. van der Laan,et al.  Multiple Testing for Gene Expression Data: An Investigation of Null Distributions with Consequences for the Permutation Test , 2003, METMBS.

[43]  A. Khodursky,et al.  A Case Study on Choosing Normalization Methods and Test Statistics for Two-Channel Microarray Data , 2004, Comparative and functional genomics.

[44]  J. Olson,et al.  A regression-based method to identify differentially expressed genes in microarray time course studies and its application in an inducible Huntington's disease transgenic model. , 2002, Human molecular genetics.

[45]  X. Cui,et al.  Improved statistical tests for differential gene expression by shrinking variance components estimates. , 2005, Biostatistics.

[46]  Cheng Cheng,et al.  Improving false discovery rate estimation , 2004, Bioinform..

[47]  XU GUO,et al.  Using Weighted Permutation Scores to Detect Differential Gene Expression with Microarray Data , 2005, J. Bioinform. Comput. Biol..

[48]  Michael Ruogu Zhang,et al.  Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. , 1998, Molecular biology of the cell.

[49]  C M Kendziorski,et al.  On parametric empirical Bayes methods for comparing multiple groups using replicated gene expression profiles , 2003, Statistics in medicine.

[50]  Wei Pan,et al.  Modified Nonparametric Approaches to Detecting Differentially Expressed Genes in Replicated Microarray Experiments , 2003, Bioinform..