Empirical assessment of the impact of sample number and read depth on RNA-Seq analysis workflow performance

BackgroundRNA-Sequencing analysis methods are rapidly evolving, and the tool choice for each step of one common workflow, differential expression analysis, which includes read alignment, expression modeling, and differentially expressed gene identification, has a dramatic impact on performance characteristics. Although a number of workflows are emerging as high performers that are robust to diverse input types, the relative performance characteristics of these workflows when either read depth or sample number is limited–a common occurrence in real-world practice–remain unexplored.ResultsHere, we evaluate the impact of varying read depth and sample number on the performance of differential gene expression identification workflows, as measured by precision, or the fraction of genes correctly identified as differentially expressed, and by recall, or the fraction of differentially expressed genes identified. We focus our analysis on 30 high-performing workflows, systematically varying the read depth and number of biological replicates of patient monocyte samples provided as input. We find that, in general for most workflows, read depth has little effect on workflow performance when held above two million reads per sample, with reduced workflow performance below this threshold. The greatest impact of decreased sample number is seen below seven samples per group, when more heterogeneity in workflow performance is observed. The choice of differential expression identification tool, in particular, has a large impact on the response to limited inputs.ConclusionsAmong the tested workflows, the recall/precision balance remains relatively stable at a range of read depths and sample numbers, although some workflows are more sensitive to input restriction. At ranges typically recommended for biological studies, performance is more greatly impacted by the number of biological replicates than by read depth. Caution should be used when selecting analysis workflows and interpreting results from low sample number experiments, as all workflows exhibit poorer performance at lower sample numbers near typically reported values, with variable impact on recall versus precision. These analyses highlight the performance characteristics of common differential gene expression workflows at varying read depths and sample numbers, and provide empirical guidance in experimental and analytical design.

[1]  Stefan Schewe,et al.  Transcript profiling of CD16‐positive monocytes reveals a unique molecular fingerprint , 2012, European journal of immunology.

[2]  Wolfgang Huber,et al.  Love MI, Huber W, Anders S.. Moderated estimation of fold change and dispersion for RNA-Seq data with DESeq2. Genome Biol 15: 550 , 2014 .

[3]  W. Huber,et al.  Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2 , 2014, Genome Biology.

[4]  S. Salzberg,et al.  StringTie enables improved reconstruction of a transcriptome from RNA-seq reads , 2015, Nature Biotechnology.

[5]  Charles C. Kim,et al.  Empirical assessment of analysis workflows for differential expression analysis of human samples using RNA-Seq , 2016, BMC Bioinformatics.

[6]  Peng Liu,et al.  Sample size calculation while controlling false discovery rate for differential expression analysis with RNA-sequencing experiments , 2016, BMC Bioinformatics.

[7]  Lior Pachter,et al.  Near-optimal probabilistic RNA-seq quantification , 2016, Nature Biotechnology.

[8]  Laura L. Elo,et al.  Comparison of software packages for detecting differential expression in RNA-seq studies , 2013, Briefings Bioinform..

[9]  Nuno A. Fonseca,et al.  RNA-Seq Gene Profiling - A Systematic Empirical Comparison , 2014, bioRxiv.

[10]  Harald Binder,et al.  Feasibility of sample size calculation for RNA‐seq studies , 2017, Briefings Bioinform..

[11]  Claudio Lottaz,et al.  Comparison of gene expression profiles between human and mouse monocyte subsets. , 2010, Blood.

[12]  Nuno A. Fonseca,et al.  RNA-Seq Gene Profiling - A Systematic Empirical Comparison , 2014, bioRxiv.

[13]  Silvano Sozzani,et al.  Nomenclature of monocytes and dendritic cells in blood. , 2010, Blood.

[14]  I. Nookaew,et al.  A comprehensive comparison of RNA-Seq-based transcriptome analysis from reads to differential gene expression and cross-comparison with microarrays: a case study in Saccharomyces cerevisiae , 2012, Nucleic acids research.

[15]  M. Robinson,et al.  Differential analyses for RNA-seq: transcript-level estimates improve gene-level inferences. , 2015, F1000Research.

[16]  B. Oliver,et al.  Comparison of normalization and differential expression analyses using RNA-Seq data from 726 individual Drosophila melanogaster , 2016, BMC Genomics.

[17]  Wing-Cheong Wong,et al.  Gene expression profiling reveals the defining features of the classical, intermediate, and nonclassical human monocyte subsets. , 2011, Blood.

[18]  Claudio Lottaz,et al.  Comparison of gene expression profiles between human and mouse monocyte , 2016 .

[19]  Matthew D. Young,et al.  From RNA-seq reads to differential expression results , 2010, Genome Biology.

[20]  A. Conesa,et al.  Data quality aware analysis of differential expression in RNA-seq with NOISeq R/Bioc package , 2015, Nucleic acids research.

[21]  Sandrine Dudoit,et al.  Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments , 2010, BMC Bioinformatics.

[22]  Mark A. van de Wiel,et al.  General power and sample size calculations for high-dimensional genomic data , 2013, Statistical applications in genetics and molecular biology.

[23]  G. Abecasis,et al.  Low-coverage sequencing: implications for design of complex trait association studies. , 2011, Genome research.

[24]  M. Stephens,et al.  RNA-seq: an assessment of technical reproducibility and comparison with gene expression arrays. , 2008, Genome research.

[25]  Lana X Garmire,et al.  Power analysis and sample size estimation for RNA-Seq differential expression , 2014, RNA.

[26]  Gabor T. Marth,et al.  Scotty: a web tool for designing RNA-Seq experiments to measure differential gene expression , 2013, Bioinform..

[27]  C. Mason,et al.  Comprehensive evaluation of differential gene expression analysis methods for RNA-seq data , 2013, Genome Biology.

[28]  G. Barton,et al.  How many biological replicates are needed in an RNA-seq experiment and which differential expression tool should you use? , 2015, RNA.

[29]  Thomas R. Gingeras,et al.  STAR: ultrafast universal RNA-seq aligner , 2013, Bioinform..

[31]  Hao Wu,et al.  PROPER: comprehensive power evaluation for differential expression using RNA-seq , 2015, Bioinform..

[32]  Steven N. Hart,et al.  Calculating Sample Size Estimates for RNA Sequencing Data , 2013, J. Comput. Biol..

[33]  Mark D. Robinson,et al.  edgeR: a Bioconductor package for differential expression analysis of digital gene expression data , 2009, Bioinform..

[34]  Rob Patro,et al.  Salmon provides fast and bias-aware quantification of transcript expression , 2017, Nature Methods.

[35]  Matthew E. Ritchie,et al.  limma powers differential expression analyses for RNA-sequencing and microarray studies , 2015, Nucleic acids research.

[36]  A. Conesa,et al.  Differential expression in RNA-seq: a matter of depth. , 2011, Genome research.

[37]  Jeffrey T Leek,et al.  Transcript-level expression analysis of RNA-seq experiments with HISAT, StringTie and Ballgown , 2016, Nature Protocols.

[38]  R. Tibshirani,et al.  Significance analysis of microarrays applied to the ionizing radiation response , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[39]  Michael Poidinger,et al.  Human Tissues Contain CD141hi Cross-Presenting Dendritic Cells with Functional Homology to Mouse CD103+ Nonlymphoid Dendritic Cells , 2012, Immunity.

[40]  Hoda Mirsafian,et al.  Transcriptome landscape of human primary monocytes at different sequencing depth. , 2017, Genomics.

[41]  Alexander G Williams,et al.  RNA‐seq Data: Challenges in and Recommendations for Experimental Design and Analysis , 2014, Current protocols in human genetics.

[42]  Steven L Salzberg,et al.  HISAT: a fast spliced aligner with low memory requirements , 2015, Nature Methods.

[43]  Robert Gentleman,et al.  ShortRead: a bioconductor package for input, quality assessment and exploration of high-throughput sequence data , 2009, Bioinform..

[44]  Denis C. Bauer,et al.  A Comparative Study of Techniques for Differential Expression Analysis on RNA-Seq Data , 2014, bioRxiv.

[45]  Mingyao Li,et al.  Evaluating the Impact of Sequencing Depth on Transcriptome Profiling in Human Adipose , 2013, PloS one.

[46]  Jie Zhou,et al.  RNA-seq differential expression studies: more sequence or more replication? , 2014, Bioinform..

[47]  Robert Tibshirani,et al.  Finding consistent patterns: A nonparametric approach for identifying differential expression in RNA-Seq data , 2013, Statistical methods in medical research.

[48]  Dmitri D. Pervouchine,et al.  A benchmark for RNA-seq quantification pipelines , 2016, Genome Biology.