Transcriptome assembly from long-read RNA-seq alignments with StringTie2

RNA sequencing using the latest single-molecule sequencing instruments produces reads that are thousands of nucleotides long. The ability to assemble these long reads can greatly improve the sensitivity of long-read analyses. Here we present StringTie2, a reference-guided transcriptome assembler that works with both short and long reads. StringTie2 includes new methods to handle the high error rate of long reads and offers the ability to work with full-length super-reads assembled from short reads, which further improves the quality of short-read assemblies. StringTie2 is more accurate and faster and uses less memory than all comparable short-read and long-read analysis tools.

[1]  Veli Mäkinen,et al.  Evaluating approaches to find exon chains based on long reads , 2016, bioRxiv.

[2]  S. Salzberg,et al.  StringTie enables improved reconstruction of a transcriptome from RNA-seq reads , 2015, Nature Biotechnology.

[3]  Mauricio O. Carneiro,et al.  The advantages of SMRT sequencing , 2013, Genome Biology.

[4]  Heng Li,et al.  Minimap2: pairwise alignment for nucleotide sequences , 2017, Bioinform..

[5]  Steven L Salzberg,et al.  HISAT: a fast spliced aligner with low memory requirements , 2015, Nature Methods.

[6]  Thomas R. Gingeras,et al.  STAR: ultrafast universal RNA-seq aligner , 2013, Bioinform..

[7]  Tanya Z. Berardini,et al.  The Arabidopsis Information Resource (TAIR): improved gene annotation and new tools , 2011, Nucleic Acids Res..

[8]  Donald Sharon,et al.  Defining a personal, allele-specific, and single-molecule long-read transcriptome , 2014, Proceedings of the National Academy of Sciences.

[9]  Colin N. Dewey,et al.  RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome , 2011, BMC Bioinformatics.

[10]  Kiyoshi Asai,et al.  PBSIM: PacBio reads simulator - toward accurate genome assembly , 2013, Bioinform..

[11]  E. Moriyama,et al.  Next-Generation Transcriptome Assembly: Strategies and Performance Analaysis , 2018, Bioinformatics in the Era of Post Genomics and Big Data.

[12]  Carl Kingsford,et al.  Accurate assembly of transcripts through phase-preserving graph decomposition , 2017, Nature Biotechnology.

[13]  Wing Hung Wong,et al.  Characterization of the human ESC transcriptome by hybrid sequencing , 2013, Proceedings of the National Academy of Sciences.

[14]  Jeffrey T Leek,et al.  Transcript-level expression analysis of RNA-seq experiments with HISAT, StringTie and Ballgown , 2016, Nature Protocols.

[15]  Lennart Martens,et al.  1 SQANTI : extensive characterization of long read transcript sequences for quality control in 1 full-length transcriptome identification and quantification 2 3 , 2017 .

[16]  Thomas D. Wu,et al.  GMAP: a genomic mapping and alignment program for mRNA and EST sequence , 2005, Bioinform..

[17]  Nanopore native RNA sequencing of a human poly(A) transcriptome , 2019, Nature Methods.

[18]  Steven L Salzberg,et al.  Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype , 2019, Nature Biotechnology.

[19]  Faye D. Schilkey,et al.  A survey of the sorghum transcriptome using single-molecule long reads , 2016, Nature Communications.

[20]  Claus O. Wilke,et al.  Limitations of alignment-free tools in total RNA-seq quantification , 2018, BMC Genomics.

[21]  Rob Patro,et al.  Salmon provides fast and bias-aware quantification of transcript expression , 2017, Nature Methods.

[22]  Kresimir Krizanovic,et al.  Evaluation of tools for long read RNA-seq splice-aware alignment , 2017, bioRxiv.

[23]  S. Salzberg,et al.  CHESS: a new human gene catalog curated from thousands of large-scale RNA sequencing experiments reveals extensive transcriptional noise , 2018, Genome Biology.

[24]  Benedict Paten,et al.  Improved data analysis for the MinION nanopore sequencer , 2015, Nature Methods.

[25]  James A. Yorke,et al.  QuorUM: An Error Corrector for Illumina Reads , 2013, PloS one.

[26]  Helga Thorvaldsdóttir,et al.  Integrative Genomics Viewer , 2011, Nature Biotechnology.

[27]  Panos Kalnis,et al.  Karect: accurate correction of substitution, insertion and deletion errors for next-generation sequencing data , 2015, Bioinform..

[28]  Ali Mortazavi,et al.  TranscriptClean: variant-aware correction of indels, mismatches and splice junctions in long-read transcripts , 2018, Bioinform..

[29]  R. Guigó,et al.  Modelling and simulating generic RNA-Seq experiments with the flux simulator , 2012, Nucleic acids research.

[30]  Cole Trapnell,et al.  Integrative annotation of human large intergenic noncoding RNAs reveals global properties and specific subclasses. , 2011, Genes & development.

[31]  Sergey Koren,et al.  Hybrid assembly of the large and highly repetitive genome of Aegilops tauschii , a progenitor of bread wheat , with the mega-reads algorithm , 2016 .

[32]  Guojun Li,et al.  TransComb: genome-guided transcriptome assembly via combing junctions in splicing graphs , 2016, Genome Biology.

[33]  John W. S. Brown,et al.  Alternative Splicing Mediates Responses of the Arabidopsis Circadian Clock to Temperature Changes[W] , 2012, Plant Cell.

[34]  Angela N. Brooks,et al.  Full-length transcript characterization of SF3B1 mutation in chronic lymphocytic leukemia reveals downregulation of retained introns , 2018, Nature Communications.

[35]  Michael Roberts,et al.  The MaSuRCA genome assembler , 2013, Bioinform..

[36]  Anders Krogh,et al.  Bayesian transcriptome assembly , 2014, Genome Biology.

[37]  Xiandong Meng,et al.  Widespread Polycistronic Transcripts in Fungi Revealed by Single-Molecule mRNA Sequencing , 2015, PloS one.

[38]  Pedro G. Ferreira,et al.  Transcriptome and genome sequencing uncovers functional variation in humans , 2013, Nature.

[39]  Robert Patro,et al.  Sailfish: Alignment-free Isoform Quantification from RNA-seq Reads using Lightweight Algorithms , 2013, ArXiv.

[40]  N. Friedman,et al.  Trinity: reconstructing a full-length transcriptome without a genome from RNA-Seq data , 2011, Nature Biotechnology.

[41]  Eric T. Wang,et al.  Alternative Isoform Regulation in Human Tissue Transcriptomes , 2008, Nature.

[42]  Cole Trapnell,et al.  Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. , 2010, Nature biotechnology.

[43]  Martin Vingron,et al.  Oases: robust de novo RNA-seq assembly across the dynamic range of expression levels , 2012, Bioinform..

[44]  Lior Pachter,et al.  Near-optimal probabilistic RNA-seq quantification , 2016, Nature Biotechnology.

[45]  Christine G. Elsik,et al.  MaizeGDB 2018: the maize multi-genome genetics and genomics database , 2018, Nucleic Acids Res..

[46]  Rob Patro,et al.  Sailfish enables alignment-free isoform quantification from RNA-seq reads using lightweight algorithms , 2013, Nature Biotechnology.

[47]  Tyson A. Clark,et al.  Unveiling the complexity of the maize transcriptome by single-molecule long-read sequencing , 2016, Nature Communications.