Measuring reproducibility of high-throughput experiments

Reproducibility is essential to reliable scientific discovery in high-throughput experiments. In this work we propose a unified approach to measure the reproducibility of findings identified from replicate experiments and identify putative discoveries using reproducibility. Unlike the usual scalar measures of reproducibility, our approach creates a curve, which quantitatively assesses when the findings are no longer consistent across replicates. Our curve is fitted by a copula mixture model, from which we derive a quantitative reproducibility score, which we call the "irreproducible discovery rate" (IDR) analogous to the FDR. This score can be computed at each set of paired replicate ranks and permits the principled setting of thresholds both for assessing reproducibility and combining replicates. Since our approach permits an arbitrary scale for each replicate, it provides useful descriptive measures in a wide variety of situations to be explored. We study the performance of the algorithm using simulations and give a heuristic analysis of its theoretical properties. We demonstrate the effectiveness of our method in a ChIP-seq experiment.

[1]  R. Fisher,et al.  Statistical Methods for Research Workers , 1930, Nature.

[2]  A. A. Lumsdaine,et al.  The American Soldier , 1950 .

[3]  M. Sklar Fonctions de repartition a n dimensions et leurs marges , 1959 .

[4]  S. Stouffer Adjustment during army life , 1977 .

[5]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[6]  Mitchell J. Mergenthaler Nonparametrics: Statistical Methods Based on Ranks , 1979 .

[7]  N. Fisher,et al.  Chi-plots for assessing dependence , 1985 .

[8]  G. McLachlan On Bootstrapping the Likelihood Ratio Test Statistic for the Number of Components in a Normal Mixture , 1987 .

[9]  D. Oakes Multivariate survival distributions , 1994 .

[10]  K. Do,et al.  Efficient and Adaptive Estimation for Semiparametric Models. , 1994 .

[11]  C. Genest,et al.  A semiparametric estimation procedure of dependence parameters in multivariate families of distributions , 1995 .

[12]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[13]  H. Joe Multivariate models and dependence concepts , 1998 .

[14]  T. Ledwina,et al.  Data-Driven Rank Tests for Independence , 1999 .

[15]  G. A. Whitmore,et al.  Importance of replication in microarray gene expression studies: statistical methods and evidence from repetitive cDNA hybridizations. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[16]  Y. Benjamini,et al.  On the Adaptive Control of the False Discovery Rate in Multiple Testing With Independent Statistics , 2000 .

[17]  D. Blest Theory & Methods: Rank Correlation — an Alternative Measure , 2000 .

[18]  Nicholas I. Fisher,et al.  Statistical Computing and Graphics Graphical Assessment of Dependence: Is a Picture Worth 100 Tests? , 2001 .

[19]  John D. Storey A direct approach to false discovery rates , 2002 .

[20]  L. Wasserman,et al.  Operating characteristics and extensions of the false discovery rate procedure , 2002 .

[21]  B. Efron Large-Scale Simultaneous Hypothesis Testing , 2004 .

[22]  C. Genest,et al.  Detecting Dependence With Kendall Plots , 2003 .

[23]  C. Genest,et al.  On blest's measure of rank correlation , 2003 .

[24]  John D. Storey The positive false discovery rate: a Bayesian interpretation and the q-value , 2003 .

[25]  Paul T. Groth,et al.  The ENCODE (ENCyclopedia Of DNA Elements) Project , 2004, Science.

[26]  L. Wasserman,et al.  A stochastic process approach to false discovery control , 2004, math/0406519.

[27]  J. Costa,et al.  A WEIGHTED RANK MEASURE OF CORRELATION , 2005 .

[28]  R. Fisher Statistical methods for research workers , 1927, Protoplasma.

[29]  Bradley Efron,et al.  Local False Discovery Rates , 2005 .

[30]  Hanlee P. Ji,et al.  The MicroArray Quality Control (MAQC) project shows inter- and intraplatform reproducibility of gene expression measurements. , 2006, Nature biotechnology.

[31]  Daniel J. Park,et al.  A sequence-oriented comparison of gene expression measurements across different hybridization-based technologies , 2006, Nature Biotechnology.

[32]  Ling Hu Dependence patterns across financial markets: a mixed copula approach , 2006 .

[33]  Maqc Consortium The MicroArray Quality Control (MAQC) project shows inter- and intraplatform reproducibility of gene expression measurements , 2006, Nature Biotechnology.

[34]  P. S. Witzer Statistical Computing and Graphics Graphical Assessment of Dependence : Is a Picture Worth 100 Tests ? , 2006 .

[35]  Manolis Kellis,et al.  Reliable prediction of regulator targets using 12 Drosophila genomes. , 2007, Genome research.

[36]  Wenguang Sun,et al.  Oracle and Adaptive Compound Decision Rules for False Discovery Rate Control , 2007 .

[37]  Terrence S. Furey,et al.  F-Seq: a feature density estimator for high-throughput sequence tags , 2008, Bioinform..

[38]  R. Myers,et al.  An Integrated Software System for Analyzing Chip-chip and Chip-seq Data (supplementary Information) , 2008 .

[39]  Clifford A. Meyer,et al.  Model-based Analysis of ChIP-Seq (MACS) , 2008, Genome Biology.

[40]  Raja Jothi,et al.  Genome-wide identification of in vivo protein–DNA binding sites from ChIP-Seq data , 2008, Nucleic acids research.

[41]  S. Batzoglou,et al.  Genome-Wide Analysis of Transcription Factor Binding Sites Based on ChIP-Seq Data , 2008, Nature Methods.

[42]  P. Park,et al.  Design and analysis of ChIP-seq experiments for DNA-binding proteins , 2008, Nature Biotechnology.

[43]  B. Williams,et al.  Mapping and quantifying mammalian transcriptomes by RNA-Seq , 2008, Nature Methods.

[44]  Anne-Laure Boulesteix,et al.  Stability and aggregation of ranked gene lists , 2009, Briefings Bioinform..

[45]  Raymond K. Auerbach,et al.  PeakSeq enables systematic scoring of ChIP-seq experiments relative to controls , 2009, Nature Biotechnology.

[46]  P. Park ChIP–seq: advantages and challenges of a maturing technology , 2009, Nature Reviews Genetics.