Count-based differential expression analysis of RNA sequencing data using R and Bioconductor

RNA sequencing (RNA-seq) has been rapidly adopted for the profiling of transcriptomes in many areas of biology, including studies into gene regulation, development and disease. Of particular interest is the discovery of differentially expressed genes across different conditions (e.g., tissues, perturbations) while optionally adjusting for other systematic factors that affect the data-collection process. There are a number of subtle yet crucial aspects of these analyses, such as read counting, appropriate treatment of biological variability, quality control checks and appropriate setup of statistical modeling. Several variations have been presented in the literature, and there is a need for guidance on current best practices. This protocol presents a state-of-the-art computational and statistical RNA-seq differential expression analysis workflow largely based on the free open-source R language and Bioconductor software and, in particular, on two widely used tools, DESeq and edgeR. Hands-on time for typical small experiments (e.g., 4–10 samples) can be <1 h, with computation time <1 d using a standard desktop PC.

[1]  R. H. Myers Classical and modern regression with applications , 1986 .

[2]  D. Cox,et al.  Parameter Orthogonality and Approximate Conditional Inference , 1987 .

[3]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[4]  Alex E. Lash,et al.  Gene Expression Omnibus: NCBI gene expression and hybridization array data repository , 2002, Nucleic Acids Res..

[5]  Friedrich Leisch,et al.  Sweave: Dynamic Generation of Statistical Reports Using Literate Data Analysis , 2002, COMPSTAT.

[6]  S. Dudoit,et al.  STATISTICAL METHODS FOR IDENTIFYING DIFFERENTIALLY EXPRESSED GENES IN REPLICATED cDNA MICROARRAY EXPERIMENTS , 2002 .

[7]  Chiara Francalanci,et al.  Data quality assessment from the user's perspective , 2004, IQIS '04.

[8]  Jean YH Yang,et al.  Bioconductor: open software development for computational biology and bioinformatics , 2004, Genome Biology.

[9]  Robert Gentleman,et al.  Statistical Applications in Genetics and Molecular Biology , 2005 .

[10]  Gordon K. Smyth,et al.  limma: Linear Models for Microarray Data , 2005 .

[11]  Robert Gentleman,et al.  Statistical Applications in Genetics and Molecular Biology , 2005 .

[12]  Rafael A. Irizarry,et al.  Bioinformatics and Computational Biology Solutions using R and Bioconductor , 2005 .

[13]  Dominique Cansell,et al.  B Method , 2006, The Seventeen Provers of the World.

[14]  B. Meyers,et al.  Construction of small RNA cDNA libraries for deep sequencing. , 2007, Methods.

[15]  S. Brenner,et al.  Unproductive splicing of SR genes associated with highly conserved and ultraconserved DNA elements , 2007, Nature.

[16]  John D. Storey,et al.  Capturing Heterogeneity in Gene Expression Studies by Surrogate Variable Analysis , 2007, PLoS genetics.

[17]  Mark D. Robinson,et al.  Moderated statistical tests for assessing differences in tag abundance , 2007, Bioinform..

[18]  M. Robinson,et al.  Small-sample estimation of negative binomial dispersion, with applications to SAGE data. , 2007, Biostatistics.

[19]  B. Williams,et al.  Mapping and quantifying mammalian transcriptomes by RNA-Seq , 2008, Nature Methods.

[20]  Sandrine Dudoit,et al.  Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments , 2010, BMC Bioinformatics.

[21]  Lior Pachter,et al.  Sequence Analysis , 2020, Definitions.

[22]  M. Robinson,et al.  A scaling normalization method for differential expression analysis of RNA-seq data , 2010, Genome Biology.

[23]  M. Gerstein,et al.  RNA-Seq: a revolutionary tool for transcriptomics , 2009, Nature Reviews Genetics.

[24]  Cole Trapnell,et al.  How to map billions of short reads onto genomes , 2009, Nature Biotechnology.

[25]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[26]  Robert Gentleman,et al.  ShortRead: a bioconductor package for input, quality assessment and exploration of high-throughput sequence data , 2009, Bioinform..

[27]  Michael Brudno,et al.  Savant: genome browser for high-throughput sequencing data , 2010, Bioinform..

[28]  Cole Trapnell,et al.  Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. , 2010, Nature biotechnology.

[29]  Thomas J. Hardcastle,et al.  baySeq: Empirical Bayesian methods for identifying differential expression in sequence count data , 2010, BMC Bioinformatics.

[30]  M. Stephens,et al.  Sex-specific and lineage-specific alternative splicing in primates. , 2010, Genome research.

[31]  David M. Simcha,et al.  Tackling the widespread and critical impact of batch effects in high-throughput data , 2010, Nature Reviews Genetics.

[32]  W. Huber,et al.  which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. MAnorm: a robust model for quantitative comparison of ChIP-Seq data sets , 2011 .

[33]  Serban Nacu,et al.  Fast and SNP-tolerant detection of complex variants and splicing in short reads , 2010, Bioinform..

[34]  R. Gentleman,et al.  Independent filtering increases detection power for high-throughput experiments , 2010, Proceedings of the National Academy of Sciences.

[35]  R. Doerge,et al.  Statistical Design and Analysis of RNA Sequencing Data , 2010, Genetics.

[36]  Derek Y. Chiang,et al.  MapSplice: Accurate mapping of RNA-seq reads for splice junction discovery , 2010, Nucleic acids research.

[37]  N. Friedman,et al.  Comprehensive comparative analysis of strand-specific RNA sequencing methods , 2010, Nature Methods.

[38]  Mark D. Robinson,et al.  edgeR: a Bioconductor package for differential expression analysis of digital gene expression data , 2009, Bioinform..

[39]  N. Friedman,et al.  Trinity: reconstructing a full-length transcriptome without a genome from RNA-Seq data , 2011, Nature Biotechnology.

[40]  Sandrine Dudoit,et al.  GC-Content Normalization for RNA-Seq Data , 2011, BMC Bioinformatics.

[41]  A. Conesa,et al.  Differential expression in RNA-seq: a matter of depth. , 2011, Genome research.

[42]  Mark D. Robinson,et al.  Differential Gene Expression in the Siphonophore Nanomia bijuga (Cnidaria) Assessed with Multiple Next-Generation Sequencing Workflows , 2011, PloS one.

[43]  T. Borodina,et al.  A strand-specific library preparation protocol for RNA sequencing. , 2011, Methods in enzymology.

[44]  Fred A. Wright,et al.  A powerful and flexible approach to the analysis of RNA sequence count data , 2011, Bioinform..

[45]  Li Yang,et al.  Conservation of an RNA regulatory map between Drosophila and mammals. , 2011, Genome research.

[46]  K. Hansen,et al.  Sequencing technology does not eliminate biological variability , 2011, Nature Biotechnology.

[47]  K. Hansen,et al.  Removing technical variability in RNA-seq data using conditional quantile normalization , 2012, Biostatistics.

[48]  Nuno A. Fonseca,et al.  Tools for mapping high-throughput sequencing data , 2012, Bioinform..

[49]  Nicolas Delhomme,et al.  easyRNASeq: a bioconductor package for processing RNA-Seq data , 2012, Bioinform..

[50]  Dario Strbenac,et al.  Savant Genome Browser 2: visualization and analysis for population-scale genomics , 2012, Nucleic Acids Res..

[51]  Johann A. Gagnon-Bartsch,et al.  Using control genes to correct for unwanted variation in microarray data. , 2012, Biostatistics.

[52]  Davis J. McCarthy,et al.  Differential expression analysis of multifactor RNA-Seq experiments with respect to biological variation , 2012, Nucleic acids research.

[53]  Alicja Szabelska,et al.  Preferred analysis methods for single genomic regions in RNA sequencing revealed by processing the shape of coverage , 2011, Nucleic acids research.

[54]  David R. Kelley,et al.  Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks , 2012, Nature Protocols.

[55]  Charlotte Soneson,et al.  A comparison of methods for differential expression analysis of RNA-seq data , 2013, BMC Bioinformatics.

[56]  Steven P Lund,et al.  Statistical Applications in Genetics and Molecular Biology Detecting Differential Expression in RNA-sequence Data Using Quasi-likelihood with Shrunken Dispersion Estimates , 2012 .

[57]  Jamal Tazi,et al.  Regulated functional alternative splicing in Drosophila , 2011, Nucleic acids research.

[58]  W. Huber,et al.  Detecting differential usage of exons from RNA-seq data , 2012, Genome research.

[59]  Ileana Quinto,et al.  Human immunodeficiency virus-1 Tat activates NF-κB via physical interaction with IκB-α and p65 , 2011, Nucleic acids research.

[60]  Antti Honkela,et al.  Identifying differentially expressed transcripts from RNA-seq data with biological variation , 2011, Bioinform..

[61]  Dario Strbenac,et al.  Copy-number-aware differential analysis of quantitative DNA sequencing data , 2012, Genome research.

[62]  Michael A. Freitas,et al.  Proteomic Analysis Reveals New Cardiac-Specific Dystrophin-Associated Proteins , 2012, PloS one.

[63]  I. Nookaew,et al.  A comprehensive comparison of RNA-Seq-based transcriptome analysis from reads to differential gene expression and cross-comparison with microarrays: a case study in Saccharomyces cerevisiae , 2012, Nucleic acids research.

[64]  I. Ellis,et al.  Differential oestrogen receptor binding is associated with clinical outcome in breast cancer , 2011, Nature.

[65]  Rory Stark Differential Oestrogen Receptor Binding is Associated with Clinical Outcome in Breast Cancer , 2012, RECOMB.

[66]  Shane J. Neph,et al.  Foxp3 Exploits a Pre-Existent Enhancer Landscape for Regulatory T Cell Lineage Specification , 2012, Cell.

[67]  C. Glass,et al.  Rev-Erbs repress macrophage gene expression by inhibiting enhancer-directed transcription , 2013, Nature.

[68]  W. Shi,et al.  The Subread aligner: fast, accurate and scalable read mapping by seed-and-vote , 2013, Nucleic acids research.

[69]  D. Coleman-Derr,et al.  The Arabidopsis Nucleosome Remodeler DDM1 Allows DNA Methyltransferases to Access H1-Containing Heterochromatin , 2013, Cell.

[70]  Thomas R. Gingeras,et al.  STAR: ultrafast universal RNA-seq aligner , 2013, Bioinform..

[71]  Hao Wu,et al.  A new shrinkage estimator for dispersion improves differential expression detection in RNA-seq data , 2012, Biostatistics.

[72]  A. Viale,et al.  Epigenetic expansion of VHL-HIF signal output drives multi-organ metastasis in renal cancer , 2012, Nature Medicine.

[73]  A. W. van der Vaart,et al.  Bayesian analysis of RNA sequencing data by estimating multiple shrinkage priors. , 2013, Biostatistics.

[74]  C. Mason,et al.  Comprehensive evaluation of differential gene expression analysis methods for RNA-seq data , 2013, Genome Biology.

[75]  Helga Thorvaldsdóttir,et al.  Integrative Genomics Viewer (IGV): high-performance genomics data visualization and exploration , 2012, Briefings Bioinform..