MEASURING REPRODUCIBILITY OF HIGH-THROUGHPUT

Reproducibility is essential to reliable scientific discovery in highthroughput experiments. In this work, we propose a unified approach to measure the reproducibility of findings identified from replicate experiments and identify putative discoveries using reproducibility. Unlike the usual scalar measures of reproducibility, our approach creates a curve, which quantitatively assesses when the findings are no longer consistent across replicates. Our curve is fitted by a copula mixture model, from which we derive a quantitative reproducibility score, which we call the ”irreproducible discovery rate” (IDR) analogous to the FDR. This score can be computed at each set of paired replicate ranks and permits the principled setting of thresholds both for assessing reproducibility and combining replicates. Since our approach permits an arbitrary scale for each replicate, it provides useful descriptive measures in a wide variety of situations to be explored. We study the performance of the algorithm using simulations and give a heuristic analysis of its theoretical properties. We demonstrate the effectiveness of our method in a ChIP-seq experiment.

[1]  S. Batzoglou,et al.  Genome-Wide Analysis of Transcription Factor Binding Sites Based on ChIP-Seq Data , 2008, Nature Methods.

[2]  Ling Hu Dependence patterns across financial markets: a mixed copula approach , 2006 .

[3]  Paul T. Groth,et al.  The ENCODE (ENCyclopedia Of DNA Elements) Project , 2004, Science.

[4]  C. Genest,et al.  On blest's measure of rank correlation , 2003 .

[5]  B. Williams,et al.  Mapping and quantifying mammalian transcriptomes by RNA-Seq , 2008, Nature Methods.

[6]  Terrence S. Furey,et al.  F-Seq: a feature density estimator for high-throughput sequence tags , 2008, Bioinform..

[7]  P. Park,et al.  Design and analysis of ChIP-seq experiments for DNA-binding proteins , 2008, Nature Biotechnology.

[8]  Wenguang Sun,et al.  Oracle and Adaptive Compound Decision Rules for False Discovery Rate Control , 2007 .

[9]  Y. Benjamini,et al.  On the Adaptive Control of the False Discovery Rate in Multiple Testing With Independent Statistics , 2000 .

[10]  P. Park ChIP–seq: advantages and challenges of a maturing technology , 2009, Nature Reviews Genetics.

[11]  C. Genest,et al.  Detecting Dependence With Kendall Plots , 2003 .

[12]  N. Fisher,et al.  Chi-plots for assessing dependence , 1985 .

[13]  E. Suchman,et al.  The American Soldier: Adjustment During Army Life. , 1949 .

[14]  Clifford A. Meyer,et al.  Model-based Analysis of ChIP-Seq (MACS) , 2008, Genome Biology.

[15]  John D. Storey The positive false discovery rate: a Bayesian interpretation and the q-value , 2003 .

[16]  Anne-Laure Boulesteix,et al.  Stability and aggregation of ranked gene lists , 2009, Briefings Bioinform..

[17]  D. Blest Theory & Methods: Rank Correlation — an Alternative Measure , 2000 .

[18]  L. Wasserman,et al.  A stochastic process approach to false discovery control , 2004, math/0406519.

[19]  M. Newton Large-Scale Simultaneous Hypothesis Testing: The Choice of a Null Hypothesis , 2008 .

[20]  C. Genest,et al.  A semiparametric estimation procedure of dependence parameters in multivariate families of distributions , 1995 .

[21]  Manolis Kellis,et al.  Reliable prediction of regulator targets using 12 Drosophila genomes. , 2007, Genome research.

[22]  Satishs Iyengar,et al.  Multivariate Models and Dependence Concepts , 1998 .

[23]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[24]  M. Kendall Statistical Methods for Research Workers , 1937, Nature.

[25]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[26]  P. S. Witzer Statistical Computing and Graphics Graphical Assessment of Dependence : Is a Picture Worth 100 Tests ? , 2006 .

[27]  J. Costa,et al.  A WEIGHTED RANK MEASURE OF CORRELATION , 2005 .

[28]  Raja Jothi,et al.  Genome-wide identification of in vivo protein–DNA binding sites from ChIP-Seq data , 2008, Nucleic acids research.

[29]  John D. Storey A direct approach to false discovery rates , 2002 .

[30]  G. McLachlan On Bootstrapping the Likelihood Ratio Test Statistic for the Number of Components in a Normal Mixture , 1987 .

[31]  Raymond K. Auerbach,et al.  PeakSeq enables systematic scoring of ChIP-seq experiments relative to controls , 2009, Nature Biotechnology.

[32]  P. Bickel Efficient and Adaptive Estimation for Semiparametric Models , 1993 .

[33]  L. Wasserman,et al.  Operating characteristics and extensions of the false discovery rate procedure , 2002 .

[34]  M. Sklar Fonctions de repartition a n dimensions et leurs marges , 1959 .

[35]  G. A. Whitmore,et al.  Importance of replication in microarray gene expression studies: statistical methods and evidence from repetitive cDNA hybridizations. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[36]  Hanlee P. Ji,et al.  The MicroArray Quality Control (MAQC) project shows inter- and intraplatform reproducibility of gene expression measurements. , 2006, Nature biotechnology.

[37]  T. Ledwina,et al.  Data-Driven Rank Tests for Independence , 1999 .