SPATA: A Seeding and Patching Algorithm for Hybrid Transcriptome Assembly

Transcriptome assembly from RNA-Seq reads is an active area of bioinformatics research. The ever-declining cost and the increasing depth of RNA-Seq have provided unprecedented opportunities to better identify expressed transcripts. However, the nonlinear transcript structures and the ultra-high throughput of RNA-Seq reads pose significant algorithmic and computational challenges to the existing transcriptome assembly approaches, either reference-guided or de novo. While reference-guided approaches offer good sensitivity, they rely on alignment results of the splice-aware aligners and are thus unsuitable for species with incomplete reference genomes. In contrast, de novo approaches do not depend on the reference genome but face a computational daunting task derived from the complexity of the graph built for the whole transcriptome. In response to these challenges, we present a hybrid approach to exploit an incomplete reference genome without relying on splice-aware aligners. We have designed a split-and-align procedure to efficiently localize the reads to individual genomic loci, which is followed by an accurate de novo assembly to assemble reads falling into each locus. Using extensive simulation data, we demonstrate a high accuracy and precision in transcriptome reconstruction by comparing to selected transcriptome assembly tools. Our method is implemented in assemblySAM, a GUI software freely available at this http URL

[1]  Tao Jiang,et al.  Inference of Isoforms from Short Sequence Reads , 2010, RECOMB.

[2]  Tao Jiang,et al.  Workshop: Transcriptome assembly and isoform expression level estimation from biased RNA-Seq reads , 2012, 2012 IEEE 2nd International Conference on Computational Advances in Bio and medical Sciences (ICCABS).

[3]  Lior Pachter,et al.  Sequence Analysis , 2020, Definitions.

[4]  M. Baker De novo genome assembly: what every biologist should know , 2012, Nature Methods.

[5]  L. Feuk,et al.  Global and unbiased detection of splice junctions from RNA-seq data , 2010, Genome Biology.

[6]  N. Friedman,et al.  Trinity: reconstructing a full-length transcriptome without a genome from RNA-Seq data , 2011, Nature Biotechnology.

[7]  Steven J. M. Jones,et al.  Abyss: a Parallel Assembler for Short Read Sequence Data Material Supplemental Open Access , 2022 .

[8]  de Ng Dick Bruijn A combinatorial problem , 1946 .

[9]  I. Good Normal Recurring Decimals , 1946 .

[10]  Zhong Wang,et al.  Next-generation transcriptome assembly , 2011, Nature Reviews Genetics.

[11]  J. Mullikin,et al.  SSAHA: a fast search method for large DNA databases. , 2001, Genome research.

[12]  LiWei,et al.  Transcriptome assembly and isoform expression level estimation from biased RNA-Seq reads , 2012 .

[13]  Inanç Birol,et al.  De novo transcriptome assembly with ABySS , 2009, Bioinform..

[14]  Martin Vingron,et al.  Oases: robust de novo RNA-seq assembly across the dynamic range of expression levels , 2012, Bioinform..

[15]  Vineet Bafna,et al.  Sensitive gene fusion detection using ambiguously mapping RNA-Seq read pairs , 2011, Bioinform..

[16]  Steven L Salzberg,et al.  Fast gapped-read alignment with Bowtie 2 , 2012, Nature Methods.

[17]  B. Frey,et al.  Deep surveying of alternative splicing complexity in the human transcriptome by high-throughput sequencing , 2008, Nature Genetics.

[18]  E. Birney,et al.  Velvet: algorithms for de novo short read assembly using de Bruijn graphs. , 2008, Genome research.

[19]  Kevin Burrage,et al.  ISIS, the intron information system, reveals the high frequency of alternative splicing in the human genome , 2000, Nature Genetics.

[20]  Cole Trapnell,et al.  Computational methods for transcriptome annotation and quantification using RNA-seq , 2011, Nature Methods.

[21]  B. Williams,et al.  Mapping and quantifying mammalian transcriptomes by RNA-Seq , 2008, Nature Methods.

[22]  Yong-shu He,et al.  [Structural variation in the human genome]. , 2009, Yi chuan = Hereditas.

[23]  Fatih Ozsolak,et al.  RNA sequencing: advances, challenges and opportunities , 2011, Nature Reviews Genetics.

[24]  Tin Chi Nguyen,et al.  SASeq: A Selective and Adaptive Shrinkage Approach to Detect and Quantify Active Transcripts using RNA-Seq , 2012, ArXiv.

[25]  Derek Y. Chiang,et al.  MapSplice: Accurate mapping of RNA-seq reads for splice junction discovery , 2010, Nucleic acids research.

[26]  Gilbert S Omenn,et al.  An integrative approach to reveal driver gene fusions from paired-end sequencing data in cancer , 2009, Nature Biotechnology.

[27]  J. Rinn,et al.  Ab initio reconstruction of transcriptomes of pluripotent and lineage committed cells reveals gene structures of thousands of lincRNAs , 2010, Nature Biotechnology.

[28]  P. Pevzner,et al.  An Eulerian path approach to DNA fragment assembly , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[29]  Tao Jiang,et al.  IsoLasso: A LASSO Regression Approach to RNA-Seq Based Transcriptome Assembly - (Extended Abstract) , 2011, RECOMB.

[30]  Y. Xing,et al.  Detection of splice junctions from paired-end RNA-seq data by SpliceMap , 2010, Nucleic acids research.

[31]  Cole Trapnell,et al.  Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. , 2010, Nature biotechnology.

[32]  M. Schatz,et al.  Algorithms Gage: a Critical Evaluation of Genome Assemblies and Assembly Material Supplemental , 2008 .

[33]  J. Montoya-Burgos,et al.  Optimization of de novo transcriptome assembly from next-generation sequencing data. , 2010, Genome research.

[34]  Paul Medvedev,et al.  Computational methods for discovering structural variation with next-generation sequencing , 2009, Nature Methods.

[35]  G. Sherlock,et al.  Rnnotator: an automated de novo transcriptome assembly pipeline from stranded RNA-Seq reads , 2010, BMC Genomics.

[36]  F. Denoeud,et al.  Annotating genomes with massive-scale RNA sequencing , 2008, Genome Biology.

[37]  C. Nusbaum,et al.  ALLPATHS: de novo assembly of whole-genome shotgun microreads. , 2008, Genome research.

[38]  Misko Dzamba,et al.  Detecting copy number variation with mated short reads. , 2010, Genome research.

[39]  Lee T. Sam,et al.  Transcriptome Sequencing to Detect Gene Fusions in Cancer , 2009, Nature.

[40]  Arend Hintze,et al.  Scaling metagenome sequence assembly with probabilistic de Bruijn graphs , 2011, Proceedings of the National Academy of Sciences.

[41]  F. Clark,et al.  Understanding alternative splicing: towards a cellular code , 2005, Nature Reviews Molecular Cell Biology.

[42]  Brendan J. Frey,et al.  Deciphering the splicing code , 2010, Nature.

[43]  Kwong-Sak Leung,et al.  ABMapper: a suffix array-based tool for multi-location searching and splice-junction mapping , 2010, Bioinform..

[44]  Steven J. M. Jones,et al.  De novo assembly and analysis of RNA-seq data , 2010, Nature Methods.

[45]  Cole Trapnell,et al.  Ultrafast and memory-efficient alignment of short DNA sequences to the human genome , 2009, Genome Biology.