The optimal discovery procedure for large-scale significance testing, with applications to comparative microarray experiments.

As much of the focus of genetics and molecular biology has shifted toward the systems level, it has become increasingly important to accurately extract biologically relevant signal from thousands of related measurements. The common property among these high-dimensional biological studies is that the measured features have a rich and largely unknown underlying structure. One example of much recent interest is identifying differentially expressed genes in comparative microarray experiments. We propose a new approach aimed at optimally performing many hypothesis tests in a high-dimensional study. This approach estimates the optimal discovery procedure (ODP), which has recently been introduced and theoretically shown to optimally perform multiple significance tests. Whereas existing procedures essentially use data from only one feature at a time, the ODP approach uses the relevant information from the entire data set when testing each feature. In particular, we propose a generally applicable estimate of the ODP for identifying differentially expressed genes in microarray experiments. This microarray method consistently shows favorable performance over five highly used existing methods. For example, in testing for differential expression between two breast cancer tumor types, the ODP provides increases from 72% to 185% in the number of genes called significant at a false discovery rate of 3%. Our proposed microarray method is freely available to academic users in the open-source, point-and-click EDGE software package.

[1]  John D. Storey The optimal discovery procedure: a new approach to simultaneous significance testing , 2007 .

[2]  Deepayan Sarkar,et al.  Detecting differential gene expression with a semiparametric hierarchical mixture method. , 2004, Biostatistics.

[3]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[4]  D. Clayton,et al.  Genome-wide association studies: theoretical and practical concerns , 2005, Nature Reviews Genetics.

[5]  Jeffrey T. Leek,et al.  Gene expression EDGE : extraction and analysis of differential gene expression , 2006 .

[6]  E. Lehmann Testing Statistical Hypotheses , 1960 .

[7]  W. Wong,et al.  GoSurfer: a graphical interactive tool for comparative analysis of large gene sets in Gene Ontology space. , 2004, Applied bioinformatics.

[8]  John D. Storey,et al.  Statistical Significance for Genome-Wide Studies , 2003 .

[9]  X. Cui,et al.  Statistical tests for differential expression in cDNA microarray experiments , 2003, Genome Biology.

[10]  John D. Storey,et al.  Strong control, conservative point estimation and simultaneous conservative consistency of false discovery rates: a unified approach , 2004 .

[11]  B. Sorić Statistical “Discoveries” and Effect-Size Estimation , 1989 .

[12]  John D. Storey,et al.  Significance analysis of time course microarray experiments. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[13]  Paola Sebastiani,et al.  Statistical Challenges in Functional Genomics , 2003 .

[14]  John D. Storey A direct approach to false discovery rates , 2002 .

[15]  S. T. Buckland,et al.  An Introduction to the Bootstrap. , 1994 .

[16]  Gary A. Churchill,et al.  Analysis of Variance for Gene Expression Microarray Data , 2000, J. Comput. Biol..

[17]  John D. Storey,et al.  Statistical significance for genomewide studies , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[18]  R. Tibshirani,et al.  Significance analysis of microarrays applied to the ionizing radiation response , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[19]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[20]  Allen Rodrigo GoDIVA: a pipeline for the design of bioinformatics applications , 2004 .

[21]  Christina Kendziorski,et al.  On Differential Variability of Expression Ratios: Improving Statistical Inference about Gene Expression Changes from Microarray Data , 2001, J. Comput. Biol..

[22]  David M. Rocke,et al.  Approximate Variance-stabilizing Transformations for Gene-expression Microarray Data , 2003, Bioinform..

[23]  Ingrid Lönnstedt Replicated microarray data , 2001 .

[24]  S. Dudoit,et al.  STATISTICAL METHODS FOR IDENTIFYING DIFFERENTIALLY EXPRESSED GENES IN REPLICATED cDNA MICROARRAY EXPERIMENTS , 2002 .

[25]  J. Sudbø,et al.  Gene-expression profiles in hereditary breast cancer. , 2001, The New England journal of medicine.

[26]  Michal Linial,et al.  Novel Unsupervised Feature Filtering of Biological Data , 2006, ISMB.

[27]  John D. Storey,et al.  Empirical Bayes Analysis of a Microarray Experiment , 2001 .

[28]  D. Hartl,et al.  Bayesian analysis of gene expression levels: statistical quantification of relative mRNA level across multiple strains or treatments , 2002, Genome Biology.

[29]  John D. Storey The positive false discovery rate: a Bayesian interpretation and the q-value , 2003 .

[30]  M. Stratton,et al.  Multifactorial analysis of differences between sporadic breast cancers and cancers involving BRCA1 and BRCA2 mutations. , 1998, Journal of the National Cancer Institute.

[31]  X. Cui,et al.  Improved statistical tests for differential gene expression by shrinking variance components estimates. , 2005, Biostatistics.