A statistical approach to selecting and confirming validation targets in -omics experiments

BackgroundGenomic technologies are, by their very nature, designed for hypothesis generation. In some cases, the hypotheses that are generated require that genome scientists confirm findings about specific genes or proteins. But one major advantage of high-throughput technology is that global genetic, genomic, transcriptomic, and proteomic behaviors can be observed. Manual confirmation of every statistically significant genomic result is prohibitively expensive. This has led researchers in genomics to adopt the strategy of confirming only a handful of the most statistically significant results, a small subset chosen for biological interest, or a small random subset. But there is no standard approach for selecting and quantitatively evaluating validation targets.ResultsHere we present a new statistical method and approach for statistically validating lists of significant results based on confirming only a small random sample. We apply our statistical method to show that the usual practice of confirming only the most statistically significant results does not statistically validate result lists. We analyze an extensively validated RNA-sequencing experiment to show that confirming a random subset can statistically validate entire lists of significant results. Finally, we analyze multiple publicly available microarray experiments to show that statistically validating random samples can both (i) provide evidence to confirm long gene lists and (ii) save thousands of dollars and hundreds of hours of labor over manual validation of each significant result.ConclusionsFor high-throughput -omics studies, statistical validation is a cost-effective and statistically valid approach to confirming lists of significant results.

[1]  J M Bland,et al.  Statistical methods for assessing agreement between two methods of clinical measurement , 1986 .

[2]  Edward R Doughtery,et al.  Validation of computational methods in genomics. , 2007, Current genomics.

[3]  P. Salvaterra,et al.  Robust RT-qPCR Data Normalization: Validation and Selection of Internal Reference Genes during Post-Experimental Data Analysis , 2011, PloS one.

[4]  D. Altman,et al.  STATISTICAL METHODS FOR ASSESSING AGREEMENT BETWEEN TWO METHODS OF CLINICAL MEASUREMENT , 1986, The Lancet.

[5]  Timothy R Hughes,et al.  'Validation' in genome-scale research , 2009, Journal of biology.

[6]  John D. Storey,et al.  Capturing Heterogeneity in Gene Expression Studies by Surrogate Variable Analysis , 2007, PLoS genetics.

[7]  David M. Simcha,et al.  Tackling the widespread and critical impact of batch effects in high-throughput data , 2010, Nature Reviews Genetics.

[8]  Ronald W. Davis,et al.  Quantitative Monitoring of Gene Expression Patterns with a Complementary DNA Microarray , 1995, Science.

[9]  A. Owen Variance of the number of false discoveries , 2005 .

[10]  B. Langmead,et al.  Cloud-scale RNA-sequencing differential expression analysis with Myrna , 2010, Genome Biology.

[11]  A. Beckett,et al.  AKUFO AND IBARAPA. , 1965, Lancet.

[12]  Human intuition in the quantitative age , 2011, EMBO reports.

[13]  J. Uhm,et al.  The transcriptional network for mesenchymal transformation of brain tumours , 2010 .

[14]  Edward R Dougherty,et al.  Validation of Inference Procedures for Gene Regulatory Networks , 2007, Current genomics.

[15]  Christian P. Robert,et al.  On Bayesian Data Analysis , 2010, 1001.4656.

[16]  John D. Storey,et al.  Statistical significance for genomewide studies , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[17]  Pablo Tamayo,et al.  Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[18]  Jeffrey T Leek,et al.  A general framework for multiple testing dependence , 2008, Proceedings of the National Academy of Sciences.

[19]  Cheng Li,et al.  Adjusting batch effects in microarray expression data using empirical Bayes methods. , 2007, Biostatistics.

[20]  Cheng Cheng Internal validation inferences of significant genomic features in genome-wide screening , 2009, Comput. Stat. Data Anal..

[21]  David B. Allison,et al.  A mixture model approach for the analysis of microarray gene expression data , 2002 .

[22]  A. Nobel,et al.  Heading Down the Wrong Pathway: on the Influence of Correlation within Gene Sets , 2010, BMC Genomics.

[23]  Hanlee P. Ji,et al.  Next-generation DNA sequencing , 2008, Nature Biotechnology.

[24]  M. Kenward,et al.  An Introduction to the Bootstrap , 2007 .

[25]  S Joshua Swamidass,et al.  An Economic Framework to Prioritize Confirmatory Tests after a High-Throughput Screen , 2010, Journal of biomolecular screening.

[26]  J. Fak,et al.  Chaolin Zhang and Its Combinatorial Controls Integrative Modeling Defines the Nova Splicing-Regulatory Network , 2013 .

[27]  David B. Dunson,et al.  Bayesian Data Analysis , 2010 .

[28]  Sandrine Dudoit,et al.  Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments , 2010, BMC Bioinformatics.