Principles of transcriptome analysis and gene expression quantification: an RNA‐seq tutorial

Genome‐wide analyses and high‐throughput screening was long reserved for biomedical applications and genetic model organisms. With the rapid development of massively parallel sequencing nanotechnology (or next‐generation sequencing) and simultaneous maturation of bioinformatic tools, this situation has dramatically changed. Genome‐wide thinking is forging its way into disciplines like evolutionary biology or molecular ecology that were historically confined to small‐scale genetic approaches. Accessibility to genome‐scale information is transforming these fields, as it allows us to answer long‐standing questions like the genetic basis of local adaptation and speciation or the evolution of gene expression profiles that until recently were out of reach. Many in the eco‐evolutionary sciences will be working with large‐scale genomic data sets, and a basic understanding of the concepts and underlying methods is necessary to judge the work of others. Here, I briefly introduce next‐generation sequencing and then focus on transcriptome shotgun sequencing (RNA‐seq). This article gives a broad overview and provides practical guidance for the many steps involved in a typical RNA‐seq work flow from sampling, to RNA extraction, library preparation and data analysis. I focus on principles, present useful tools where appropriate and point out where caution is needed or progress to be expected. This tutorial is mostly targeted at beginners, but also contains potentially useful reflections for the more experienced.

[1]  Ning Leng,et al.  EBSeq: an empirical Bayes hierarchical model for inference in RNA-seq experiments , 2013, Bioinform..

[2]  Nagarjun Vijay,et al.  Challenges and strategies in transcriptome assembly and differential gene expression quantification. A comprehensive in silico assessment of RNA‐seq experiments , 2013, Molecular ecology.

[3]  C. Eizaguirre,et al.  Evolutionary Ecological Genomics , 2013, Molecular ecology.

[4]  C. Primmer,et al.  The proteomics of feather development in pied flycatchers (Ficedula hypoleuca) with different plumage coloration , 2012, Molecular ecology.

[5]  Barry Merriman,et al.  Progress in Ion Torrent semiconductor chip based sequencing , 2012, Electrophoresis.

[6]  M. Grabherr,et al.  Population-scale sequencing reveals genetic differentiation due to local adaptation in Atlantic herring , 2012, Proceedings of the National Academy of Sciences.

[7]  Hannah Jaris,et al.  The simple fool's guide to population genomics via RNA‐Seq: an introduction to high‐throughput sequencing data analysis , 2012, Molecular ecology resources.

[8]  Pall I. Olason,et al.  The genomic landscape of species divergence in Ficedula flycatchers , 2012, Nature.

[9]  C. Eizaguirre,et al.  Exploring local immunological adaptation of two stickleback ecotypes by experimental infection and transcriptome-wide digital gene expression analysis , 2012, Molecular ecology.

[10]  I. Nookaew,et al.  A comprehensive comparison of RNA-Seq-based transcriptome analysis from reads to differential gene expression and cross-comparison with microarrays: a case study in Saccharomyces cerevisiae , 2012, Nucleic acids research.

[11]  Raymond K. Auerbach,et al.  An Integrated Encyclopedia of DNA Elements in the Human Genome , 2012, Nature.

[12]  Helga Thorvaldsdóttir,et al.  Integrative Genomics Viewer (IGV): high-performance genomics data visualization and exploration , 2012, Briefings Bioinform..

[13]  W. Huber,et al.  Detecting differential usage of exons from RNA-seq data , 2012, Genome research.

[14]  L. Chindelevitch,et al.  Causal reasoning on biological networks: interpreting transcriptional changes , 2012, Bioinform..

[15]  David R. Kelley,et al.  Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks , 2012, Nature Protocols.

[16]  Vanessa M Kvam,et al.  A comparison of statistical methods for detecting differentially expressed genes from RNA-seq data. , 2012, American journal of botany.

[17]  Mukesh Jain,et al.  NGS QC Toolkit: A Toolkit for Quality Control of Next Generation Sequencing Data , 2012, PloS one.

[18]  A. Conesa,et al.  Differential expression in RNA-seq: a matter of depth. , 2011, Genome research.

[19]  S. Salzberg,et al.  Repetitive DNA and next-generation sequencing: computational challenges and solutions , 2011, Nature Reviews Genetics.

[20]  Giovanni Parmigiani,et al.  Integrating diverse genomic data using gene sets , 2011, Genome Biology.

[21]  S. Bergmann,et al.  The evolution of gene expression levels in mammalian organs , 2011, Nature.

[22]  Ying Wang,et al.  Evaluation of the coverage and depth of transcriptome by RNA-Seq in chickens , 2011, BMC Bioinformatics.

[23]  Zhong Wang,et al.  Next-generation transcriptome assembly , 2011, Nature Reviews Genetics.

[24]  Daniel A. Skelly,et al.  A powerful and flexible statistical framework for testing hypotheses of allele-specific gene expression from RNA-seq data. , 2011, Genome research.

[25]  Colin N. Dewey,et al.  RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome , 2011, BMC Bioinformatics.

[26]  J. Galindo,et al.  Applications of next generation sequencing in molecular ecology of non-model organisms , 2011, Heredity.

[27]  You-Qiang Song,et al.  Evaluation of next-generation sequencing software in mapping and assembly , 2011, Journal of Human Genetics.

[28]  Joshua S. Paul,et al.  Genotype and SNP calling from next-generation sequencing data , 2011, Nature Reviews Genetics.

[29]  Martin Goodson,et al.  Stampy: a statistical algorithm for sensitive and fast mapping of Illumina sequence reads. , 2011, Genome research.

[30]  Cole Trapnell,et al.  Computational methods for transcriptome annotation and quantification using RNA-seq , 2011, Nature Methods.

[31]  Robert J. Elshire,et al.  A Robust, Simple Genotyping-by-Sequencing (GBS) Approach for High Diversity Species , 2011, PloS one.

[32]  N. Friedman,et al.  Trinity: reconstructing a full-length transcriptome without a genome from RNA-Seq data , 2011, Nature Biotechnology.

[33]  Tina T. Hu,et al.  Multiplexed shotgun genotyping for rapid and efficient genetic mapping. , 2011, Genome research.

[34]  M. DePristo,et al.  A framework for variation discovery and genotyping using next-generation DNA sequencing data , 2011, Nature Genetics.

[35]  L. Coin,et al.  Haplotype and isoform specific expression estimation using multi-mapping RNA-seq reads , 2011, Genome Biology.

[36]  Fatih Ozsolak,et al.  RNA sequencing: advances, challenges and opportunities , 2011, Nature Reviews Genetics.

[37]  J. Bryk,et al.  General lack of global dosage compensation in ZZ/ZW systems? Broadening the perspective with RNA-seq , 2011, BMC Genomics.

[38]  Matthew D. Young,et al.  From RNA-seq reads to differential expression results , 2010, Genome Biology.

[39]  J. Slate,et al.  Adaptation genomics: the next generation. , 2010, Trends in ecology & evolution.

[40]  Damian Szklarczyk,et al.  The STRING database in 2011: functional interaction networks of proteins, globally integrated and scored , 2010, Nucleic Acids Res..

[41]  Eric T. Wang,et al.  Analysis and design of RNA sequencing experiments for identifying isoform regulation , 2010, Nature Methods.

[42]  R. Nielsen,et al.  Ascertainment biases in SNP chips affect measures of population divergence. , 2010, Molecular biology and evolution.

[43]  Alberto Magi,et al.  Bioinformatics for Next Generation Sequencing Data , 2010, Genes.

[44]  A. Nekrutenko,et al.  Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences , 2010, Genome Biology.

[45]  S. Le,et al.  Sequence signatures and mRNA concentration can explain two-thirds of protein abundance variation in a human cell line , 2010, Molecular systems biology.

[46]  Thomas J. Hardcastle,et al.  baySeq: Empirical Bayesian methods for identifying differential expression in sequence count data , 2010, BMC Bioinformatics.

[47]  N. Friedman,et al.  Comprehensive comparative analysis of strand-specific RNA sequencing methods , 2010, Nature Methods.

[48]  Michael C. Schatz,et al.  Cloud Computing and the DNA Data Race , 2010, Nature Biotechnology.

[49]  R. Doerge,et al.  Statistical Design and Analysis of RNA Sequencing Data , 2010, Genetics.

[50]  Campbell O. Webb,et al.  Picante: R tools for integrating phylogenies and ecology , 2010, Bioinform..

[51]  Ola Söderberg,et al.  In situ detection and genotyping of individual mRNA molecules , 2010, Nature Methods.

[52]  R. Wilson,et al.  Comparative genomics based on massive parallel transcriptome sequencing reveals patterns of substitution and selection across 10 bird species , 2010, Molecular ecology.

[53]  Sandrine Dudoit,et al.  Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments , 2010, BMC Bioinformatics.

[54]  Marcel H. Schulz,et al.  Prediction of alternative isoforms from exon expression levels in RNA-Seq experiments , 2010, Nucleic acids research.

[55]  D. Tautz,et al.  Nucleotide divergence vs. gene expression differentiation: comparative transcriptome sequencing in natural isolates from the carrion crow and its hybrid zone with the hooded crow , 2010, Molecular ecology.

[56]  Detlef Weigel,et al.  Next Generation Molecular Ecology , 2010, Molecular ecology.

[57]  B. Harr,et al.  Genome‐wide analysis of alternative splicing evolution among Mus subspecies , 2010, Molecular ecology.

[58]  Davis J. McCarthy,et al.  edgeR: a Bioconductor package for differential expression analysis of digital gene expression data , 2009, Bioinform..

[59]  A. Mortazavi,et al.  Computation for ChIP-seq and RNA-seq studies , 2009, Nature Methods.

[60]  Hui Guo,et al.  MapView: visualization of short reads alignment on a desktop computer , 2009, Bioinform..

[61]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[62]  Naama Barkai,et al.  A Yeast Hybrid Provides Insight into the Evolution of Gene Expression Regulation , 2009, Science.

[63]  Cole Trapnell,et al.  How to map billions of short reads onto genomes , 2009, Nature Biotechnology.

[64]  L. Bernatchez,et al.  Divergence in gene regulation at young life history stages of whitefish (Coregonus sp.) and the emergence of genomic isolation , 2009, BMC Evolutionary Biology.

[65]  Alex Bateman,et al.  Bioinformatics for next generation sequencing. , 2009, Bioinformatics.

[66]  P. Etter,et al.  Rapid SNP Discovery and Genetic Mapping Using Sequenced RAD Markers , 2008, PloS one.

[67]  L. Keller,et al.  Pleiotropy in the melanocortin system, coloration and behavioural syndromes. , 2008, Trends in ecology & evolution.

[68]  M. Stephens,et al.  RNA-seq: an assessment of technical reproducibility and comparison with gene expression arrays. , 2008, Genome research.

[69]  B. Williams,et al.  Mapping and quantifying mammalian transcriptomes by RNA-Seq , 2008, Nature Methods.

[70]  J. Marden,et al.  Rapid transcriptome characterization for a nonmodel organism using 454 pyrosequencing , 2008, Molecular ecology.

[71]  E. Mardis The impact of next-generation sequencing technology on genetics. , 2008, Trends in genetics : TIG.

[72]  S. Bensch,et al.  TECHNICAL ADVANCES: A microarray for large‐scale genomic and transcriptional analyses of the zebra finch (Taeniopygia guttata) and other passerines , 2008, Molecular ecology resources.

[73]  James R. Knight,et al.  Genome sequencing in microfabricated high-density picolitre reactors , 2005, Nature.

[74]  Jean YH Yang,et al.  Bioconductor: open software development for computational biology and bioinformatics , 2004, Genome Biology.

[75]  B. Bowen,et al.  Allelic Variation of Gene Expression in Maize Hybrids , 2004, The Plant Cell Online.

[76]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[77]  P. Green,et al.  Base-calling of automated sequencer traces using phred. I. Accuracy assessment. , 1998, Genome research.

[78]  P Green,et al.  Base-calling of automated sequencer traces using phred. II. Error probabilities. , 1998, Genome research.

[79]  W. David Kelton,et al.  Statistical design and analysis , 1986, WSC '86.

[80]  Mukesh Jain Next-generation sequencing technologies for gene expression profiling in plants. , 2012, Briefings in functional genomics.

[81]  D. Edwards,et al.  Bioinformatics tools and databases for analysis of next-generation sequence data. , 2012, Briefings in functional genomics.

[82]  ENCODEConsortium An integrated encyclopedia of DNA elements in the human genome. , 2012 .

[83]  S. Letovsky,et al.  RNA sequencing and quantitation using the Helicos Genetic Analysis System. , 2011, Methods in molecular biology.

[84]  M. Metzker Sequencing technologies — the next generation , 2010, Nature Reviews Genetics.

[85]  S. Turner,et al.  Real-time DNA sequencing from single polymerase molecules. , 2010, Methods in enzymology.

[86]  M. Gerstein,et al.  RNA-Seq: a revolutionary tool for transcriptomics , 2009, Nature Reviews Genetics.

[87]  BMC Molecular Biology BioMed Central , 2008 .

[88]  B Marshall,et al.  Gene Ontology Consortium: The Gene Ontology (GO) database and informatics resource , 2004, Nucleic Acids Res..

[89]  S. Wyman,et al.  BIOINFORMATICS APPLICATIONS NOTE , 2001 .

[90]  Hiroyuki Ogata,et al.  KEGG: Kyoto Encyclopedia of Genes and Genomes , 1999, Nucleic Acids Res..

[91]  M. Salit,et al.  Synthetic Spike-in Standards for Rna-seq Experiments Material Supplemental Open Access License Commons Creative , 2022 .