General power and sample size calculations for high-dimensional genomic data

Abstract In the design of microarray or next-generation sequencing experiments it is crucial to choose the appropriate number of biological replicates. As often the number of differentially expressed genes and their effect sizes are small and too few replicates will lead to insufficient power to detect these. On the other hand, too many replicates unnecessary leads to high experimental costs. Power and sample size analysis can guide experimentalist in choosing the appropriate number of biological replicates. Several methods for power and sample size analysis have recently been proposed for microarray data. However, most of these are restricted to two group comparisons and require user-defined effect sizes. Here we propose a pilot-data based method for power and sample size analysis which can handle more general experimental designs and uses pilot-data to obtain estimates of the effect sizes. The method can also handle χ2 distributed test statistics which enables power and sample size calculations for a much wider class of models, including high-dimensional generalized linear models which are used, e.g., for RNA-seq data analysis. The performance of the method is evaluated using simulated and experimental data from several microarray and next-generation sequencing experiments. Furthermore, we compare our proposed method for estimation of the density of effect sizes from pilot data with a recent proposed method specific for two group comparisons.

[1]  J. A. Ferreira,et al.  Approximate Sample Size Calculations with Microarray Data: An Illustration , 2006, Statistical applications in genetics and molecular biology.

[2]  Paul H. C. Eilers,et al.  Splines, knots, and penalties , 2010 .

[3]  A Tikhonov,et al.  Solution of Incorrectly Formulated Problems and the Regularization Method , 1963 .

[4]  Eric R. Ziegel,et al.  Generalized Linear Models , 2002, Technometrics.

[5]  J. Dekkers,et al.  Improved Estimation of the Noncentrality Parameter Distribution from a Large Number of t‐Statistics, with Applications to False Discovery Rate Estimation in Microarray Data Analysis , 2012, Biometrics.

[6]  Peng Liu,et al.  Quick calculation for sample size while controlling false discovery rate with application to microarray analysis , 2007, Bioinform..

[7]  Steven G. Self,et al.  Power/Sample Size Calculations for Generalized Linear Models , 1988 .

[8]  Y. Benjamini,et al.  On the Adaptive Control of the False Discovery Rate in Multiple Testing With Independent Statistics , 2000 .

[9]  Chi-Hong Tseng,et al.  Sample size calculation with dependence adjustment for FDR-control in microarray studies. , 2007, Statistics in medicine.

[10]  Sin-Ho Jung,et al.  Sample size for FDR-control in microarray data analysis , 2005, Bioinform..

[11]  Lothar Reichel,et al.  TIKHONOV REGULARIZATION WITH NONNEGATIVITY CONSTRAINT , 2004 .

[12]  James J. Chen,et al.  Power and sample size estimation in microarray studies , 2010, BMC Bioinformatics.

[13]  A. E. Hoerl,et al.  Ridge regression: biased estimation for nonorthogonal problems , 2000 .

[14]  Jean YH Yang,et al.  Bioconductor: open software development for computational biology and bioinformatics , 2004, Genome Biology.

[15]  Davis J. McCarthy,et al.  Differential expression analysis of multifactor RNA-Seq experiments with respect to biological variation , 2012, Nucleic acids research.

[16]  J. A. Ferreira,et al.  Approximate Power and Sample Size Calculations with the Benjamini-Hochberg Method , 2006 .

[17]  A. Phatak,et al.  Exploiting the connection between PLS, Lanczos methods and conjugate gradients: alternative proofs of some properties of PLS , 2002 .

[18]  Irène Gijbels,et al.  Frequent problems in calculating integrals and optimizing objective functions: a case study in density deconvolution , 2007, Stat. Comput..

[19]  Hisashi Noma,et al.  Estimating Effect Sizes of Differentially Expressed Genes for Power and Sample‐Size Assessments in Microarray Experiments , 2011, Biometrics.

[20]  F. O’Sullivan A Statistical Perspective on Ill-posed Inverse Problems , 1986 .

[21]  Bauke Ylstra,et al.  CGHpower: exploring sample size calculations for chromosomal copy number experiments , 2010, BMC Bioinformatics.

[22]  D. Harville Matrix Algebra From a Statistician's Perspective , 1998 .

[23]  Dianne P. O'Leary,et al.  The Use of the L-Curve in the Regularization of Discrete Ill-Posed Problems , 1993, SIAM J. Sci. Comput..

[24]  M. Wand,et al.  Semiparametric Regression: Parametric Regression , 2003 .

[25]  Mark A van de Wiel,et al.  Estimating the False Discovery Rate Using Nonparametric Deconvolution , 2007, Biometrics.

[26]  Hongyu Zhao,et al.  Practical guidelines for assessing power and false discovery rate for a fixed sample size in microarray experiments , 2008, Statistics in medicine.

[27]  J. Nagy,et al.  Quasi-Newton approach to nonnegative image restorations , 2000 .

[28]  B. Efron Empirical Bayes Estimates for Large-Scale Prediction Problems , 2009, Journal of the American Statistical Association.

[29]  P. Hansen Discrete Inverse Problems: Insight and Algorithms , 2010 .

[30]  Alex E. Lash,et al.  Gene Expression Omnibus: NCBI gene expression and hybridization array data repository , 2002, Nucleic Acids Res..

[31]  Steven G. Self,et al.  Power Calculations for Likelihood Ratio Tests in Generalized Linear Models , 1992 .

[32]  G. Shieh,et al.  On Power and Sample Size Calculations for Likelihood Ratio Tests in Generalized Linear Models , 2000, Biometrics.

[33]  M. Hestenes,et al.  Methods of conjugate gradients for solving linear systems , 1952 .

[34]  G A Whitmore,et al.  Power and sample size for DNA microarray studies , 2002, Statistics in medicine.

[35]  Yudi Pawitan,et al.  False discovery rate, sensitivity and sample size for microarray studies , 2005, Bioinform..

[36]  K. Takezawa,et al.  Introduction to Nonparametric Regression , 2005 .

[37]  M. van Iterson,et al.  Relative power and sample size analysis on gene expression profiling data , 2009, BMC Genomics.

[38]  Mark D. Robinson,et al.  edgeR: a Bioconductor package for differential expression analysis of digital gene expression data , 2009, Bioinform..

[39]  Sue-Jane Wang,et al.  Sample size for gene expression microarray experiments , 2005, Bioinform..

[40]  Herman Midelfart,et al.  A mixture model approach to sample size estimation in two-sample comparative microarray experiments , 2008, BMC Bioinformatics.

[41]  D. Ruppert,et al.  Exploring the Information in p‐Values for the Analysis and Planning of Multiple‐Test Experiments , 2007, Biometrics.

[42]  J. Varah Pitfalls in the Numerical Solution of Linear Ill-Posed Problems , 1981 .

[43]  Robert Tibshirani,et al.  A simple method for assessing sample sizes in microarray experiments , 2006, BMC Bioinformatics.

[44]  J. A. Ferreira,et al.  On the Benjamini-Hochberg method , 2006, math/0611265.

[45]  B. Lindqvist,et al.  Estimating the proportion of true null hypotheses, with application to DNA microarray data , 2005 .

[46]  P. McCullagh,et al.  Generalized Linear Models , 1992 .

[47]  J. Friedman,et al.  A Statistical View of Some Chemometrics Regression Tools , 1993 .

[48]  John D. Storey The positive false discovery rate: a Bayesian interpretation and the q-value , 2003 .

[49]  R. Vossen,et al.  Deep sequencing-based expression analysis shows major advances in robustness, resolution and inter-lab portability over five microarray platforms , 2008, Nucleic acids research.