Comparative assessment of methods for the computational inference of transcript isoform abundance from RNA-seq data

BackgroundUnderstanding the regulation of gene expression, including transcription start site usage, alternative splicing, and polyadenylation, requires accurate quantification of expression levels down to the level of individual transcript isoforms. To comparatively evaluate the accuracy of the many methods that have been proposed for estimating transcript isoform abundance from RNA sequencing data, we have used both synthetic data as well as an independent experimental method for quantifying the abundance of transcript ends at the genome-wide level.ResultsWe found that many tools have good accuracy and yield better estimates of gene-level expression compared to commonly used count-based approaches, but they vary widely in memory and runtime requirements. Nucleotide composition and intron/exon structure have comparatively little influence on the accuracy of expression estimates, which correlates most strongly with transcript/gene expression levels. To facilitate the reproduction and further extension of our study, we provide datasets, source code, and an online analysis tool on a companion website, where developers can upload expression estimates obtained with their own tool to compare them to those inferred by the methods assessed here.ConclusionsAs many methods for quantifying isoform abundance with comparable accuracy are available, a user’s choice will likely be determined by factors such as the memory and runtime requirements, as well as the availability of methods for downstream analyses. Sequencing-based methods to quantify the abundance of specific transcript regions could complement validation schemes based on synthetic data and quantitative PCR in future or ongoing assessments of RNA-seq analysis methods.

[1]  Masao Nagasaki,et al.  TIGAR: transcript isoform abundance estimation method with gapped alignment of RNA-Seq data by variational Bayesian inference , 2013, Bioinform..

[2]  Peter F. Stadler,et al.  Fast Mapping of Short Sequences with Mismatches, Insertions and Deletions Using Index Structures , 2009, PLoS Comput. Biol..

[3]  R. Guigó,et al.  Modelling and simulating generic RNA-Seq experiments with the flux simulator , 2012, Nucleic acids research.

[4]  Claudia Angelini,et al.  Computational approaches for isoform detection and estimation: good and bad news , 2014, BMC Bioinformatics.

[5]  Antti Honkela,et al.  Identifying differentially expressed transcripts from RNA-seq data with biological variation , 2011, Bioinform..

[6]  Griffin M. Weber,et al.  BioNumbers—the database of key numbers in molecular and cell biology , 2009, Nucleic Acids Res..

[7]  May D. Wang,et al.  Benchmarking RNA-Seq quantification tools , 2013, 2013 35th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC).

[8]  Christopher B. Burge,et al.  Alternative Splicing of RNA Triplets Is Often Regulated and Accelerates Proteome Evolution , 2012, PLoS biology.

[9]  J. Kawai,et al.  Cap analysis gene expression for high-throughput analysis of transcriptional starting point and identification of promoter usage , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[10]  E. van Nimwegen,et al.  Global 3′ UTR shortening has a limited effect on protein abundance in proliferating T cells , 2014, Nature Communications.

[11]  B. Williams,et al.  Mapping and quantifying mammalian transcriptomes by RNA-Seq , 2008, Nature Methods.

[12]  Masao Nagasaki,et al.  TIGAR2: sensitive and accurate estimation of transcript isoform expression with longer RNA-Seq reads , 2014, BMC Genomics.

[13]  P. Sharp,et al.  Building Robust Transcriptomes with Master Splicing Factors , 2014, Cell.

[14]  Wing Hung Wong,et al.  Characterization of the human ESC transcriptome by hybrid sequencing , 2013, Proceedings of the National Academy of Sciences.

[15]  Ion I Măndoiu,et al.  Bootstrap-based differential gene expression analysis for RNA-Seq data with and without replicates , 2014, BMC Genomics.

[16]  C. Mason,et al.  Comprehensive evaluation of differential gene expression analysis methods for RNA-seq data , 2013, Genome Biology.

[17]  W. Huber,et al.  which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. MAnorm: a robust model for quantitative comparison of ChIP-Seq data sets , 2011 .

[18]  Yi Xing,et al.  An ESRP‐regulated splicing programme is abrogated during the epithelial–mesenchymal transition , 2010, The EMBO journal.

[19]  Bronwen L. Aken,et al.  GENCODE: The reference human genome annotation for The ENCODE Project , 2012, Genome research.

[20]  David G Hendrickson,et al.  Differential analysis of gene regulation at transcript resolution with RNA-seq , 2012, Nature Biotechnology.

[21]  H. Green,et al.  QUANTITATIVE STUDIES OF THE GROWTH OF MOUSE EMBRYO CELLS IN CULTURE AND THEIR DEVELOPMENT INTO ESTABLISHED LINES , 1963, The Journal of cell biology.

[22]  Aaron R. Quinlan,et al.  Bioinformatics Applications Note Genome Analysis Bedtools: a Flexible Suite of Utilities for Comparing Genomic Features , 2022 .

[23]  Mohsen Khorshid,et al.  CLIPZ: a database and analysis environment for experimentally determined binding sites of RNA-binding proteins , 2010, Nucleic Acids Res..

[24]  Michael D. Wilson,et al.  The Evolutionary Landscape of Alternative Splicing in Vertebrate Species , 2012, Science.

[25]  Eric T. Wang,et al.  MBNL proteins repress ES-cell-specific alternative splicing and reprogramming , 2013, Nature.

[26]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[27]  Wing Hung Wong,et al.  Statistical inferences for isoform expression in RNA-Seq , 2009, Bioinform..

[28]  Andrew H. Beck,et al.  3′-End Sequencing for Expression Quantification (3SEQ) from Archival Tumor Samples , 2010, PloS one.

[29]  Ncbi National Center for Biotechnology Information , 2008 .

[30]  Christopher J. Lee,et al.  A genomic view of alternative splicing , 2002, Nature Genetics.

[31]  Mark D. Robinson,et al.  edgeR: a Bioconductor package for differential expression analysis of digital gene expression data , 2009, Bioinform..

[32]  Brendan J. Frey,et al.  Deciphering the splicing code , 2010, Nature.

[33]  G. Bornkamm,et al.  Characterization of EBV‐genome negative “null” and “T” cell lines derived from children with acute lymphoblastic leukemia and leukemic transformed non‐Hodgkin lymphoma , 1977, International journal of cancer.

[34]  D. Bartel,et al.  Extensive alternative polyadenylation during zebrafish development , 2012, Genome research.

[35]  Ning Leng,et al.  EBSeq: an empirical Bayes hierarchical model for inference in RNA-seq experiments , 2013, Bioinform..

[36]  Cole Trapnell,et al.  Ultrafast and memory-efficient alignment of short DNA sequences to the human genome , 2009, Genome Biology.

[37]  Colin N. Dewey,et al.  RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome , 2011, BMC Bioinformatics.

[38]  Hanlee P. Ji,et al.  The MicroArray Quality Control (MAQC) project shows inter- and intraplatform reproducibility of gene expression measurements. , 2006, Nature biotechnology.

[39]  Gael P. Alamancos,et al.  Methods to study splicing from high-throughput RNA sequencing data. , 2013, Methods in molecular biology.

[40]  Hagen Blankenburg,et al.  The implications of alternative splicing in the ENCODE protein complement , 2007, Proceedings of the National Academy of Sciences.

[41]  Cole Trapnell,et al.  TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions , 2013, Genome Biology.

[42]  Thomas J. Hardcastle,et al.  baySeq: Empirical Bayesian methods for identifying differential expression in sequence count data , 2010, BMC Bioinformatics.

[43]  Mihaela Zavolan,et al.  Genome-wide analysis of pre-mRNA 3' end processing reveals a decisive role of human cleavage factor I in the regulation of 3' UTR length. , 2012, Cell reports.

[44]  Ion I. Mandoiu,et al.  Estimation of alternative splicing isoform frequencies from RNA-Seq data , 2010, Algorithms for Molecular Biology.

[45]  Alessandro Vullo,et al.  Ensembl 2015 , 2014, Nucleic Acids Res..

[46]  A. Oudenaarden,et al.  Validation of noise models for single-cell transcriptomics , 2014, Nature Methods.

[47]  J. Rinn,et al.  Ab initio reconstruction of transcriptomes of pluripotent and lineage committed cells reveals gene structures of thousands of lincRNAs , 2010, Nature biotechnology.

[48]  J. Harrow,et al.  Assessment of transcript reconstruction methods for RNA-seq , 2013, Nature Methods.

[49]  L. Pachter,et al.  Streaming fragment assignment for real-time analysis of sequencing experiments , 2012, Nature Methods.

[50]  O. Gotoh,et al.  Species-specific variation of alternative splicing and transcriptional initiation in six eukaryotes. , 2005, Gene.

[51]  Véronique Martin,et al.  Mapping Reads on a Genomic Sequence: An Algorithmic Overview and a Practical Comparative Analysis , 2012, J. Comput. Biol..

[52]  Robert Gentleman,et al.  rtracklayer: an R package for interfacing with genome browsers , 2009, Bioinform..

[53]  J. Manley,et al.  Turning on a fuel switch of cancer: hnRNP proteins regulate alternative splicing of pyruvate kinase mRNA. , 2010, Cancer research.

[54]  Cole Trapnell,et al.  Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. , 2010, Nature biotechnology.

[55]  M. Rattray,et al.  Improved variational Bayes inference for transcript expression estimation , 2014, Statistical applications in genetics and molecular biology.

[56]  Rob Patro,et al.  Sailfish enables alignment-free isoform quantification from RNA-seq reads using lightweight algorithms , 2013, Nature Biotechnology.

[57]  Maqc Consortium The MicroArray Quality Control (MAQC) project shows inter- and intraplatform reproducibility of gene expression measurements , 2006, Nature Biotechnology.

[58]  L. Coin,et al.  Haplotype and isoform specific expression estimation using multi-mapping RNA-seq reads , 2011, Genome Biology.

[59]  J. Rinn,et al.  Ab initio reconstruction of transcriptomes of pluripotent and lineage committed cells reveals gene structures of thousands of lincRNAs , 2010, Nature Biotechnology.

[60]  Tao Jiang,et al.  Workshop: Transcriptome assembly and isoform expression level estimation from biased RNA-Seq reads , 2012, 2012 IEEE 2nd International Conference on Computational Advances in Bio and medical Sciences (ICCABS).

[61]  David P. Kreil,et al.  A comprehensive assessment of RNA-seq accuracy, reproducibility and information content by the Sequencing Quality Control consortium , 2014, Nature Biotechnology.

[62]  J. Harrow,et al.  Systematic evaluation of spliced alignment programs for RNA-seq data , 2013, Nature Methods.

[63]  K. Ovaska,et al.  Large-scale data integration framework provides a comprehensive view on glioblastoma multiforme , 2010, Genome Medicine.

[64]  Xuegong Zhang,et al.  DEGseq: an R package for identifying differentially expressed genes from RNA-seq data , 2010, Bioinform..

[65]  Marcel Martin Cutadapt removes adapter sequences from high-throughput sequencing reads , 2011 .

[66]  Terry Gaasterland,et al.  Impact of alternative initiation, splicing, and termination on the diversity of the mRNA transcripts encoded by the mouse transcriptome. , 2003, Genome research.

[67]  Ernest Turro,et al.  Flexible analysis of RNA-seq data using mixed effects models , 2014, Bioinform..

[68]  Robert Patro,et al.  Sailfish: Alignment-free Isoform Quantification from RNA-seq Reads using Lightweight Algorithms , 2013, ArXiv.

[69]  Jun Kawai,et al.  A Simple Physical Model Predicts Small Exon Length Variations , 2006, PLoS genetics.

[70]  Wencheng Li,et al.  Accurate mapping of cleavage and polyadenylation sites by 3' region extraction and deep sequencing. , 2014, Methods in molecular biology.

[71]  Vicent Pelechano,et al.  An efficient method for genome-wide polyadenylation site mapping and RNA quantification , 2013, Nucleic acids research.

[72]  Robert Gentleman,et al.  Software for Computing and Annotating Genomic Ranges , 2013, PLoS Comput. Biol..

[73]  Ümit V. Çatalyürek,et al.  Benchmarking short sequence mapping tools , 2013, BMC Bioinformatics.

[74]  Colin N. Dewey,et al.  RNA-Seq gene expression estimation with read mapping uncertainty , 2009, Bioinform..

[75]  Peter J. Shepard,et al.  Complex and dynamic landscape of RNA polyadenylation revealed by PAS-Seq. , 2011, RNA.

[76]  Chonghui Cheng,et al.  Snail Represses the Splicing Regulator Epithelial Splicing Regulatory Protein 1 to Promote Epithelial-Mesenchymal Transition* , 2012, The Journal of Biological Chemistry.