Synthetic data sets for the identification of key ingredients for RNA-seq differential analysis

Numerous statistical pipelines are now available for the differential analysis of gene expression measured with RNA-sequencing technology. Most of them are based on similar statistical frameworks after normalization, differing primarily in the choice of data distribution, mean and variance estimation strategy and data filtering. We propose an evaluation of the impact of these choices when few biological replicates are available through the use of synthetic data sets. This framework is based on real data sets and allows the exploration of various scenarios differing in the proportion of non-differentially expressed genes. Hence, it provides an evaluation of the key ingredients of the differential analysis, free of the biases associated with the simulation of data using parametric models. Our results show the relevance of a proper modeling of the mean by using linear or generalized linear modeling. Once the mean is properly modeled, the impact of the other parameters on the performance of the test is much less important. Finally, we propose to use the simple visualization of the raw P-value histogram as a practical evaluation criterion of the performance of differential analysis methods on real data sets.

[1]  Gordon K Smyth,et al.  Linear Models and Empirical Bayes Methods for Assessing Differential Expression in Microarray Experiments , 2004, Statistical applications in genetics and molecular biology.

[2]  Jennifer L. O'Day Statistical Significance for Genome Wide Studies Under Unequal Variance , 2015 .

[3]  Charlotte Soneson,et al.  iCOBRA: open, reproducible, standardized and live method benchmarking , 2015 .

[4]  David González,et al.  A flexible count data model to fit the wide diversity of expression profiles arising from extensively replicated RNA-seq experiments , 2013, BMC Bioinformatics.

[5]  Steven P Lund,et al.  Statistical Applications in Genetics and Molecular Biology Detecting Differential Expression in RNA-sequence Data Using Quasi-likelihood with Shrunken Dispersion Estimates , 2012 .

[6]  John D. Storey,et al.  Statistical significance for genomewide studies , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[7]  Aaron T. L. Lun,et al.  Differential Expression Analysis of Complex RNA-seq Experiments Using edgeR , 2014 .

[8]  J. Görlach,et al.  Growth Stage–Based Phenotypic Analysis of Arabidopsis , 2001, The Plant Cell Online.

[9]  Davis J. McCarthy,et al.  Differential expression analysis of multifactor RNA-Seq experiments with respect to biological variation , 2012, Nucleic acids research.

[10]  W. Huber,et al.  Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2 , 2014, Genome Biology.

[11]  David P. Kreil,et al.  The concordance between RNA-seq and microarray data depends on chemical treatment and transcript abundance , 2014, Nature Biotechnology.

[12]  Jeffrey T. Leek,et al.  Statistical Applications in Genetics and Molecular Biology The Joint Null Criterion for Multiple Hypothesis Tests , 2011 .

[13]  David P. Kreil,et al.  A comprehensive assessment of RNA-seq accuracy, reproducibility and information content by the Sequencing Quality Control consortium , 2014, Nature Biotechnology.

[14]  Jie Zhou,et al.  RNA-seq differential expression studies: more sequence or more replication? , 2014, Bioinform..

[15]  W. Huber,et al.  Differential expression analysis for sequence count data , 2010 .

[16]  Alyssa C. Frazee,et al.  Polyester: Simulating RNA-Seq Datasets With Differential Transcript Expression , 2014, bioRxiv.

[17]  Pablo D. Reeb,et al.  Evaluating statistical analysis models for RNA sequencing experiments , 2013, Front. Genet..

[18]  Gilles Celeux,et al.  Data-based filtering for replicated high-throughput transcriptome sequencing experiments , 2013, Bioinform..

[19]  G. Barton,et al.  How many biological replicates are needed in an RNA-seq experiment and which differential expression tool should you use? , 2015, RNA.

[20]  M. Stephens,et al.  RNA-seq: an assessment of technical reproducibility and comparison with gene expression arrays. , 2008, Genome research.

[21]  Charity W. Law,et al.  voom: precision weights unlock linear model analysis tools for RNA-seq read counts , 2014, Genome Biology.

[22]  Jason E. Stewart,et al.  Minimum information about a microarray experiment (MIAME)—toward standards for microarray data , 2001, Nature Genetics.

[23]  C. Mason,et al.  Comprehensive evaluation of differential gene expression analysis methods for RNA-seq data , 2013, Genome Biology.

[24]  M. Robinson,et al.  Small-sample estimation of negative binomial dispersion, with applications to SAGE data. , 2007, Biostatistics.

[25]  Charlotte Soneson,et al.  A comparison of methods for differential expression analysis of RNA-seq data , 2013, BMC Bioinformatics.

[26]  Stéphane Robin,et al.  Kerfdr: a semi-parametric kernel-based approach to local false discovery rate estimation , 2009, BMC Bioinformatics.

[27]  Terence P. Speed,et al.  How data analysis affects power, reproducibility and biological insight of RNA-seq studies in complex datasets , 2015, Nucleic acids research.

[28]  Steven L Salzberg,et al.  Fast gapped-read alignment with Bowtie 2 , 2012, Nature Methods.

[29]  Laura L. Elo,et al.  Comparison of software packages for detecting differential expression in RNA-seq studies , 2013, Briefings Bioinform..

[30]  D. Allison,et al.  Towards sound epistemological foundations of statistical methods for high-dimensional biology , 2004, Nature Genetics.

[31]  Tanya Z. Berardini,et al.  The Arabidopsis Information Resource (TAIR): improved gene annotation and new tools , 2011, Nucleic Acids Res..

[32]  Hélène Touzet,et al.  SortMeRNA: fast and accurate filtering of ribosomal RNAs in metatranscriptomic data , 2012, Bioinform..

[33]  Charlotte Soneson,et al.  compcodeR - an R package for benchmarking differential expression methods for RNA-seq data , 2014, Bioinform..

[34]  Pablo D. Reeb,et al.  Assessing Dissimilarity Measures for Sample-Based Hierarchical Clustering of RNA Sequencing Data Using Plasmode Datasets , 2015, PloS one.

[35]  Dan Nettleton,et al.  SimSeq: a nonparametric approach to simulation of RNA-sequence datasets , 2015, Bioinform..

[36]  M. Robinson,et al.  A scaling normalization method for differential expression analysis of RNA-seq data , 2010, Genome Biology.

[37]  Frédérique Bitton,et al.  Genome-Wide Analysis of Arabidopsis Pentatricopeptide Repeat Proteins Reveals Their Essential Role in Organelle Biogenesis , 2004, The Plant Cell Online.

[38]  S. Dudoit,et al.  Normalization for cDNA microarray data: a robust composite method addressing single and multiple slide systematic variation. , 2002, Nucleic acids research.

[39]  Jacques van Helden,et al.  Confidence intervals are no salvation from the alleged fickleness of the P value , 2016, Nature Methods.

[40]  Fred A. Wright,et al.  A powerful and flexible approach to the analysis of RNA sequence count data , 2011, Bioinform..