De novo assembly of transcriptome from next-generation sequencing data

Reconstruction of transcriptome by de novo assembly from next generation sequencing (NGS) short-sequence reads provides an essential mean to catalog expressed genes, identify splicing isoforms, and capture the expression detail of transcripts for organisms with no reference genome available. De novo transcriptome assembly faces many unique challenges, including alternative splicing, variable expression level covering a dynamic range of several orders of magnitude, artifacts introduced by reverse transcription, etc. In the current review, we illustrate the grand strategy in applying De Bruijn Graph (DBG) approach in de novo transcriptome assembly.We further analyze many parameters proven critical in transcriptome assembly using DBG. Among them, k-mer length, coverage depth of reads, genome complexity, performance of different programs are addressed in greater details. A multi-k-mer strategy balancing efficiency and sensitivity is discussed and highly recommended for de novo transcriptome assembly. Future direction points to the combination of NGS and third generation sequencing technology that would greatly enhance the power of de novo transcriptomics study.

[1]  Steven M. Johnson,et al.  A high-resolution, nucleosome position map of C. elegans reveals a lack of universal sequence-dictated positioning. , 2008, Genome research.

[2]  Tim Hubbard Finishing the euchromatic sequence of the human genome , 2004 .

[3]  Steven J. M. Jones,et al.  Alternative expression analysis by RNA sequencing , 2010, Nature Methods.

[4]  Colin N. Dewey,et al.  De novo transcript sequence reconstruction from RNA-seq using the Trinity platform for reference generation and analysis , 2013, Nature Protocols.

[5]  Lior Pachter,et al.  Sequence Analysis , 2020, Definitions.

[6]  Michal J. Okoniewski,et al.  Hybridization interactions between probesets in short oligo microarrays lead to spurious correlations , 2006, BMC Bioinformatics.

[7]  M. Schatz,et al.  Algorithms Gage: a Critical Evaluation of Genome Assemblies and Assembly Material Supplemental , 2008 .

[8]  A. Weber,et al.  RNA-Seq Assembly – Are We There Yet? , 2012, Front. Plant Sci..

[9]  S. Clark,et al.  DNA sequencing using a four‐color confocal fluorescence capillary array scanner , 1996, Electrophoresis.

[10]  J. Montoya-Burgos,et al.  Optimization of de novo transcriptome assembly from next-generation sequencing data. , 2010, Genome research.

[11]  Shuai Zhan,et al.  The Monarch Butterfly Genome Yields Insights into Long-Distance Migration , 2011, Cell.

[12]  Cutoffs and k-mers: implications from a transcriptome study in allopolyploid plants , 2012, BMC Genomics.

[13]  T. Borchardt,et al.  A de novo assembly of the newt transcriptome combined with proteomic validation identifies new protein families expressed during tissue regeneration , 2013, Genome Biology.

[14]  J. Kawai,et al.  Cap analysis gene expression for high-throughput analysis of transcriptional starting point and identification of promoter usage , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[15]  B. Frey,et al.  Deep surveying of alternative splicing complexity in the human transcriptome by high-throughput sequencing , 2008, Nature Genetics.

[16]  Cole Trapnell,et al.  Integrative annotation of human large intergenic noncoding RNAs reveals global properties and specific subclasses. , 2011, Genes & development.

[17]  Huanming Yang,et al.  Deep RNA sequencing at single base-pair resolution reveals high complexity of the rice transcriptome. , 2010, Genome research.

[18]  David G. Knowles,et al.  The GENCODE v7 catalog of human long noncoding RNAs: Analysis of their gene structure, evolution, and expression , 2012, Genome research.

[19]  Nancy F. Hansen,et al.  Accurate Whole Human Genome Sequencing using Reversible Terminator Chemistry , 2008, Nature.

[20]  B. Williams,et al.  Mapping and quantifying mammalian transcriptomes by RNA-Seq , 2008, Nature Methods.

[21]  A. Weber,et al.  Evolution of C4 Photosynthesis in the Genus Flaveria: How Many and Which Genes Does It Take to Make C4?[W] , 2011, Plant Cell.

[22]  K. Kazan Alternative splicing and proteome diversity in plants: the tip of the iceberg has just emerged. , 2003, Trends in plant science.

[23]  Serafim Batzoglou,et al.  Algorithmic challenges in mammalian whole‐genome assembly , 2005 .

[24]  René L. Warren,et al.  Assembling millions of short DNA sequences using SSAKE , 2006, Bioinform..

[25]  P. Morin,et al.  Serial analysis of gene expression reveals differential expression between endometriosis and normal endometrium. Possible roles for AXL and SHC1 in the pathogenesis of endometriosis , 2008, Reproductive biology and endocrinology : RB&E.

[26]  G. Ast,et al.  Alternative splicing and evolution: diversification, exon definition and function , 2010, Nature Reviews Genetics.

[27]  E. Birney,et al.  Velvet: algorithms for de novo short read assembly using de Bruijn graphs. , 2008, Genome research.

[28]  David R. Kelley,et al.  Quake: quality-aware detection and correction of sequencing errors , 2010, Genome Biology.

[29]  M. Schatz,et al.  Hybrid error correction and de novo assembly of single-molecule sequencing reads , 2012, Nature Biotechnology.

[30]  M. Marra,et al.  Applications of new sequencing technologies for transcriptome analysis. , 2009, Annual review of genomics and human genetics.

[31]  M. Lercher,et al.  An mRNA Blueprint for C4 Photosynthesis Derived from Comparative Transcriptomics of Closely Related C3 and C4 Species1[W][OA] , 2010, Plant Physiology.

[32]  R. Gibbs,et al.  Mind the Gap: Upgrading Genomes with Pacific Biosciences RS Long-Read Sequencing Technology , 2012, PloS one.

[33]  Scott J Emrich,et al.  Assessing De Novo transcriptome assembly metrics for consistency and utility , 2013, BMC Genomics.

[34]  S. Batzoglou,et al.  Whole-Genome Sequencing and Assembly with High-Throughput, Short-Read Technologies , 2007, PloS one.

[35]  Huanming Yang,et al.  De novo assembly of human genomes with massively parallel short read sequencing. , 2010, Genome research.

[36]  Thomas D. Wu,et al.  Deep RNA sequencing analysis of readthrough gene fusions in human prostate adenocarcinoma and reference samples , 2011, BMC Medical Genomics.

[37]  Siu-Ming Yiu,et al.  COPE: an accurate k-mer-based pair-end reads connection tool to facilitate genome assembly , 2012, Bioinform..

[38]  Shiguo Huang,et al.  Developmental and insecticide-resistant insights from the de novo assembled transcriptome of the diamondback moth, Plutella xylostella. , 2012, Genomics.

[39]  X. Chen,et al.  Identification and characterization of microRNAs in raw milk during different periods of lactation, commercial fluid, and powdered milk products , 2010, Cell Research.

[40]  M. Gonzalo Claros,et al.  SeqTrim: a high-throughput pipeline for pre-processing any type of sequence read , 2010, BMC Bioinformatics.

[41]  Jin Billy Li,et al.  Accurate identification of human Alu and non-Alu RNA editing sites , 2012, Nature Methods.

[42]  Lars Bolund,et al.  State of the art de novo assembly of human genomes from massively parallel sequencing data , 2010, Human Genomics.

[43]  C. Kai,et al.  CAGE: cap analysis of gene expression , 2006, Nature Methods.

[44]  Sylvie Cloutier,et al.  SNP Discovery through Next-Generation Sequencing and Its Applications , 2012, International journal of plant genomics.

[45]  Srinivas Aluru,et al.  A survey of error-correction methods for next-generation sequencing , 2013, Briefings Bioinform..

[46]  Inanç Birol,et al.  De novo transcriptome assembly with ABySS , 2009, Bioinform..

[47]  Martin Vingron,et al.  Oases: robust de novo RNA-seq assembly across the dynamic range of expression levels , 2012, Bioinform..

[48]  L. Xiangjun,et al.  ? Higher Education Press and Springer-Verlag 2007 , 2007 .

[49]  Yoichi Takenaka,et al.  Tissue-specific functions based on information content of gene ontology using cap analysis gene expression , 2007, Medical & Biological Engineering & Computing.

[50]  Michael D. Wilson,et al.  The Evolutionary Landscape of Alternative Splicing in Vertebrate Species , 2012, Science.

[51]  Zhong Wang,et al.  ALE: a generic assembly likelihood evaluation framework for assessing the accuracy of genome and metagenome assemblies , 2013, Bioinform..

[52]  T. Blumenthal Gene clusters and polycistronic transcription in eukaryotes , 1998, BioEssays : news and reviews in molecular, cellular and developmental biology.

[53]  Lira Mamanova,et al.  FRT-seq: Amplification-free, strand-specific, transcriptome sequencing , 2010, Nature Methods.

[54]  M. Gerstein,et al.  The Transcriptional Landscape of the Yeast Genome Defined by RNA Sequencing , 2008, Science.

[55]  Ronald C. Taylor An overview of the Hadoop/MapReduce/HBase framework and its current applications in bioinformatics , 2010, BMC Bioinformatics.

[56]  Wei Li,et al.  A haplotype map of genomic variations and genome-wide association studies of agronomic traits in foxtail millet (Setaria italica) , 2013, Nature Genetics.

[57]  Laurent Modolo,et al.  UrQt: an efficient software for the Unsupervised Quality trimming of NGS data , 2015, BMC Bioinformatics.

[58]  E. Mardis The impact of next-generation sequencing technology on genetics. , 2008, Trends in genetics : TIG.

[59]  D. Luo,et al.  Global transcriptome and gene regulation network for secondary metabolite biosynthesis of tea plant (Camellia sinensis) , 2015, BMC Genomics.

[60]  Stefanie Dimmeler,et al.  Long Noncoding RNAs: From Clinical Genetics to Therapeutic Targets? , 2016, Journal of the American College of Cardiology.

[61]  Le-Shin Wu,et al.  Trinity RNA-Seq assembler performance optimization , 2012, XSEDE '12.

[62]  Peter J. Bickel,et al.  The Developmental Transcriptome of Drosophila melanogaster , 2010, Nature.

[63]  Cole Trapnell,et al.  Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. , 2010, Nature biotechnology.

[64]  P. Pevzner,et al.  An Eulerian path approach to DNA fragment assembly , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[65]  M. Blaxter,et al.  Comparing de novo assemblers for 454 transcriptome data , 2010, BMC Genomics.

[66]  Thomas C. Conway,et al.  Succinct data structures for assembling large genomes , 2010, Bioinform..

[67]  M. Metzker Sequencing technologies — the next generation , 2010, Nature Reviews Genetics.

[68]  David G Hendrickson,et al.  Differential analysis of gene regulation at transcript resolution with RNA-seq , 2012, Nature Biotechnology.

[69]  Xiuzhen Huang,et al.  Bridger: a new framework for de novo transcriptome assembly using RNA-seq data , 2015, Genome Biology.

[70]  I. Dworkin,et al.  A pipeline for the de novo assembly of the Themira biloba (Sepsidae: Diptera) transcriptome using a multiple k-mer length approach , 2014, BMC Genomics.

[71]  Zong-Hong Zhang,et al.  Global transcriptome profiles of Camellia sinensis during cold acclimation , 2013, BMC Genomics.

[72]  Carsten O. Daub,et al.  TagDust—a program to eliminate artifacts from next generation sequencing data , 2009, Bioinform..

[73]  J. Bonfield,et al.  Finishing the euchromatic sequence of the human genome , 2004, Nature.

[74]  D. Bartel,et al.  Long noncoding RNAs in C. elegans , 2012, Genome research.

[75]  Erika Check Hayden,et al.  Genome sequencing: the third generation , 2009, Nature.

[76]  Piero Carninci,et al.  Tag-based approaches for transcriptome research and genome annotation , 2005, Nature Methods.

[77]  E. Bornberg-Bauer,et al.  Evaluating Characteristics of De Novo Assembly Software on 454 Transcriptome Data: A Simulation Approach , 2012, PloS one.

[78]  Riccardo Velasco,et al.  An evaluation of the PacBio RS platform for sequencing and de novo assembly of a chloroplast genome , 2013, BMC Genomics.

[79]  Vincent J. Magrini,et al.  Extending assembly of short DNA sequences to handle error , 2007, Bioinform..

[80]  S. Leys,et al.  Optimization of preservation and storage time of sponge tissues to obtain quality mRNA for next‐generation sequencing , 2012, Molecular ecology resources.

[81]  Mark Johnston,et al.  Benchmarking next-generation transcriptome sequencing for functional and evolutionary genomics. , 2009, Molecular biology and evolution.

[82]  Juliane C. Dohm,et al.  SHARCGS, a fast and highly accurate short-read assembly algorithm for de novo genomic sequencing. , 2007, Genome research.

[83]  Mark Stitt,et al.  RobiNA: a user-friendly, integrated software solution for RNA-Seq-based transcriptomics , 2012, Nucleic Acids Res..

[84]  C. Wahlestedt,et al.  Regulatory roles of natural antisense transcripts , 2009, Nature Reviews Molecular Cell Biology.

[85]  Jin Billy Li,et al.  Edinburgh Research Explorer Identifying Rna Editing Sites Using Rna Sequencing Data Alone , 2022 .

[86]  A. Brazma,et al.  Reuse of public genome-wide gene expression data , 2012, Nature Reviews Genetics.

[87]  A. Kerlavage,et al.  Complementary DNA sequencing: expressed sequence tags and human genome project , 1991, Science.

[88]  L. Ponnala,et al.  Strategies for transcriptome analysis in nonmodel plants. , 2012, American journal of botany.

[89]  James R. Knight,et al.  Genome sequencing in microfabricated high-density picolitre reactors , 2005, Nature.

[90]  Xun Xu,et al.  SOAPdenovo-Trans: de novo transcriptome assembly with short RNA-Seq reads , 2013, Bioinform..

[91]  Eric T. Wang,et al.  Alternative Isoform Regulation in Human Tissue Transcriptomes , 2008, Nature.

[92]  T. Babak,et al.  A quantitative atlas of polyadenylation in five mammals , 2012, Genome research.

[93]  B. Williams,et al.  The Developmental Transcriptome of the Mosquito Aedes aegypti, an Invasive Species and Major Arbovirus Vector , 2013, G3: Genes, Genomes, Genetics.

[94]  N. Friedman,et al.  Trinity: reconstructing a full-length transcriptome without a genome from RNA-Seq data , 2011, Nature Biotechnology.

[95]  Steven J. M. Jones,et al.  Abyss: a Parallel Assembler for Short Read Sequence Data Material Supplemental Open Access , 2022 .

[96]  Weiguo Liu,et al.  A Parallel Algorithm for Error Correction in High-Throughput Short-Read Data on CUDA-Enabled Graphics Hardware , 2010, J. Comput. Biol..

[97]  C. Burge,et al.  Evolutionary Dynamics of Gene and Isoform Regulation in Mammalian Tissues , 2012, Science.

[98]  Yamile Marquez,et al.  Transcriptome survey reveals increased complexity of the alternative splicing landscape in Arabidopsis , 2012, Genome research.

[99]  Xuan Li,et al.  Optimizing hybrid assembly of next-generation sequence data from Enterococcus faecium: a microbe with highly divergent genome , 2012, BMC Systems Biology.

[100]  Fatih Ozsolak,et al.  RNA sequencing: advances, challenges and opportunities , 2011, Nature Reviews Genetics.

[101]  Joseph K. Pickrell,et al.  Understanding mechanisms underlying human gene expression variation with RNA sequencing , 2010, Nature.

[102]  Kenta Nakai,et al.  Genome-wide characterization of transcriptional start sites in humans by integrative transcriptome analysis. , 2011, Genome research.

[103]  Frank Grützner,et al.  The evolution of lncRNA repertoires and expression patterns in tetrapods , 2014, Nature.

[104]  Yongsheng Bai,et al.  Evaluation of de novo transcriptome assemblies from RNA-Seq data , 2014, Genome Biology.

[105]  Michael B. Eisen,et al.  Improving transcriptome assembly through error correction of high-throughput sequence reads , 2013, PeerJ.

[106]  M. Lercher,et al.  An mRNA Blueprint for C 4 Photosynthesis Derived from Comparative Transcriptomics of Closely Related C 3 and C 4 Species 1 [ W ] [ OA ] , 2010 .

[107]  Laigeng Li,et al.  Conservation and functional influence of alternative splicing in wood formation of Populus and Eucalyptus , 2014, BMC Genomics.

[108]  K. O. Elliston,et al.  Toward the development of a gene index to the human genome: an assessment of the nature of high-throughput EST sequence data. , 1996, Genome research.

[109]  Matthew D. MacManes,et al.  On the optimal trimming of high-throughput mRNA sequence data , 2014, Front. Genet..

[110]  C. Mathew Encyclopedia of genetics, genomics, proteomics and bioinformatics. , 2005 .

[111]  K. Blum,et al.  A novel in silico reverse-transcriptomics-based identification and blood-based validation of a panel of sub-type specific biomarkers in lung cancer , 2013, BMC Genomics.

[112]  C Joel McManus,et al.  Global analysis of trans-splicing in Drosophila , 2010, Proceedings of the National Academy of Sciences.

[113]  G. Sherlock,et al.  Rnnotator: an automated de novo transcriptome assembly pipeline from stranded RNA-Seq reads , 2010, BMC Genomics.

[114]  Muwang Li,et al.  Alternative splicing and trans-splicing events revealed by analysis of the Bombyx mori transcriptome. , 2012, RNA.

[115]  Nathan Christopher Shaner,et al.  A comparison across non-model animals suggests an optimal sequencing depth for de novo transcriptome assembly , 2013, BMC Genomics.

[116]  L. Hillier,et al.  A global analysis of C. elegans trans-splicing. , 2011, Genome research.

[117]  Hanlee P. Ji,et al.  Next-generation DNA sequencing , 2008, Nature Biotechnology.

[118]  W. Gish,et al.  Gene structure prediction and alternative splicing analysis using genomically aligned ESTs. , 2001, Genome research.

[119]  A. Kasarskis,et al.  A window into third-generation sequencing. , 2010, Human molecular genetics.

[120]  Pavel A Pevzner,et al.  How to apply de Bruijn graphs to genome assembly. , 2011, Nature biotechnology.

[121]  Chuan-Yun Li,et al.  Evolutionary Interrogation of Human Biology in Well-Annotated Genomic Framework of Rhesus Macaque , 2014, Molecular biology and evolution.

[122]  Caspar Zialor DNA sequencing with chain terminating inhibitors , 2014 .

[123]  M. Horan Application of serial analysis of gene expression to the study of human genetic disease , 2009, Human Genetics.

[124]  A. Weber,et al.  Evolution of C 4 Photosynthesis in the Genus Flaveria : How Many and Which Genes Does It Take to Make C 4 ? W , 2011 .

[125]  S. Leff,et al.  Complex transcriptional units: diversity in gene expression by alternative RNA processing. , 1986, Annual review of biochemistry.

[126]  Xuan Li,et al.  Optimizing de novo transcriptome assembly from short-read RNA-Seq data: a comparative study , 2011, BMC Bioinformatics.

[127]  Michael F. Lin,et al.  Systematic identification of long noncoding RNAs expressed during zebrafish embryogenesis. , 2012, Genome research.

[128]  Pei Hao,et al.  The evolutionary landscape of intergenic trans-splicing events in insects , 2015, Nature Communications.

[129]  S. Schuster Next-generation sequencing transforms today's biology , 2008, Nature Methods.

[130]  Christopher J. Lee,et al.  Genome-wide detection of alternative splicing in expressed sequences of human genes , 2001, Nucleic Acids Res..

[131]  Steven J. M. Jones,et al.  De novo assembly and analysis of RNA-seq data , 2010, Nature Methods.

[132]  Henry D. Priest,et al.  Genome-wide mapping of alternative splicing in Arabidopsis thaliana. , 2010, Genome research.

[133]  SHUIGENG ZHOU,et al.  When Cloud Computing Meets Bioinformatics: a Review , 2013, J. Bioinform. Comput. Biol..

[134]  Q. Jin,et al.  Evaluating de Bruijn Graph Assemblers on 454 Transcriptomic Data , 2012, PLoS ONE.

[135]  S. Turner,et al.  Real-Time DNA Sequencing from Single Polymerase Molecules , 2009, Science.

[136]  Jialei Duan,et al.  Optimizing de novo common wheat transcriptome assembly using short-read RNA-Seq data , 2012, BMC Genomics.

[137]  Ivan Roa,et al.  Serial Analysis of Gene Expression Identifies Connective Tissue Growth Factor Expression as a Prognostic Biomarker in Gallbladder Cancer , 2008, Clinical Cancer Research.

[138]  Rithy K. Roth,et al.  Gene expression analysis by massively parallel signature sequencing (MPSS) on microbead arrays , 2000, Nature Biotechnology.

[139]  Ronald W. Davis,et al.  Quantitative Monitoring of Gene Expression Patterns with a Complementary DNA Microarray , 1995, Science.

[140]  Arend Hintze,et al.  Scaling metagenome sequence assembly with probabilistic de Bruijn graphs , 2011, Proceedings of the National Academy of Sciences.

[141]  M. Gerstein,et al.  RNA-Seq: a revolutionary tool for transcriptomics , 2009, Nature Reviews Genetics.

[142]  Alan Christoffels,et al.  A glance at quality score: implication for de novo transcriptome reconstruction of Illumina reads , 2014, Front. Genet..

[143]  Ji Huang,et al.  [Serial analysis of gene expression]. , 2002, Yi chuan = Hereditas.