Comprehensive evaluation of de novo transcriptome assembly programs and their effects on differential gene expression analysis.

MOTIVATION With the decreased cost of RNA-Seq, an increasing number of non-model organisms have been sequenced. Due to the lack of reference genomes, de novo transcriptome assembly is required. However, there is limited systematic research evaluating the quality of de novo transcriptome assemblies and how the assembly quality influences downstream analysis. RESULTS We used two authentic RNA-Seq datasets from Arabidopsis thaliana, and produced transcriptome assemblies using eight programs with a series of k-mer sizes (from 25 to 71), including BinPacker, Bridger, IDBA-tran, Oases-Velvet, SOAPdenovo-Trans, SSP, Trans-ABySS and Trinity. We measured the assembly quality in terms of reference genome base and gene coverage, transcriptome assembly base coverage, number of chimeras and number of recovered full-length transcripts. SOAPdenovo-Trans performed best in base coverage, while Trans-ABySS performed best in gene coverage and number of recovered full-length transcripts. In terms of chimeric sequences, BinPacker and Oases-Velvet were the worst, while IDBA-tran, SOAPdenovo-Trans, Trans-ABySS and Trinity produced fewer chimeras across all single k-mer assemblies. In differential gene expression analysis, about 70% of the significantly differentially expressed genes (DEG) were the same using reference genome and de novo assemblies. We further identify four reasons for the differences in significant DEG between reference genome and de novo transcriptome assemblies: incomplete annotation, exon level differences, transcript fragmentation and incorrect gene annotation, which we suggest that de novo assembly is beneficial even when a reference genome is available. AVAILABILITY AND IMPLEMENTATION Software used in this study are publicly available at the authors' websites. CONTACT gribskov@purdue.eduSupplementary information: Supplementary data are available at Bioinformatics online.

[1]  Martin Vingron,et al.  Oases: robust de novo RNA-seq assembly across the dynamic range of expression levels , 2012, Bioinform..

[2]  Steven J. M. Jones,et al.  De novo assembly and analysis of RNA-seq data , 2010, Nature Methods.

[3]  J. Harrow,et al.  Assessment of transcript reconstruction methods for RNA-seq , 2013, Nature Methods.

[4]  Richard Smith-Unna,et al.  TransRate: reference free quality assessment of de-novo transcriptome assemblies , 2015 .

[5]  Xin Li,et al.  The HEAT repeat protein ILITYHIA is required for plant immunity. , 2010, Plant & cell physiology.

[6]  E. Birney,et al.  Velvet: algorithms for de novo short read assembly using de Bruijn graphs. , 2008, Genome research.

[7]  Sang Yeol Lee,et al.  CYCLIN-DEPENDENT KINASE8 Differentially Regulates Plant Immunity to Fungal Pathogens through Kinase-Dependent and -Independent Functions in Arabidopsis[C][W] , 2014, Plant Cell.

[8]  Björn Usadel,et al.  Trimmomatic: a flexible trimmer for Illumina sequence data , 2014, Bioinform..

[9]  Helga Thorvaldsdóttir,et al.  Integrative Genomics Viewer (IGV): high-performance genomics data visualization and exploration , 2012, Briefings Bioinform..

[10]  Xun Xu,et al.  SOAPdenovo-Trans: de novo transcriptome assembly with short RNA-Seq reads , 2013, Bioinform..

[11]  Cole Trapnell,et al.  How to map billions of short reads onto genomes , 2009, Nature Biotechnology.

[12]  N. Friedman,et al.  Trinity: reconstructing a full-length transcriptome without a genome from RNA-Seq data , 2011, Nature Biotechnology.

[13]  Steven J. M. Jones,et al.  Abyss: a Parallel Assembler for Short Read Sequence Data Material Supplemental Open Access , 2022 .

[14]  Scott J Emrich,et al.  Assessing De Novo transcriptome assembly metrics for consistency and utility , 2013, BMC Genomics.

[15]  Jian Wang,et al.  SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler , 2012, GigaScience.

[16]  Stephen A. Smith,et al.  Optimizing de novo assembly of short-read RNA-seq data for phylogenomics , 2013, BMC Genomics.

[17]  Siu-Ming Yiu,et al.  IDBA-tran: a more robust de novo de Bruijn graph assembler for transcriptomes with uneven expression levels , 2013, Bioinform..

[18]  M. Gerstein,et al.  RNA-Seq: a revolutionary tool for transcriptomics , 2009, Nature Reviews Genetics.

[19]  Helga Thorvaldsdóttir,et al.  Integrative Genomics Viewer , 2011, Nature Biotechnology.

[20]  Cole Trapnell,et al.  Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. , 2010, Nature biotechnology.

[21]  Nagarjun Vijay,et al.  Challenges and strategies in transcriptome assembly and differential gene expression quantification. A comprehensive in silico assessment of RNA‐seq experiments , 2013, Molecular ecology.

[22]  E. Birney,et al.  Pfam: the protein families database , 2013, Nucleic Acids Res..

[23]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[24]  Changiz Eslahchi,et al.  SSP: an interval integer linear programming for de novo transcriptome assembly and isoform discovery of RNA-seq reads. , 2013, Genomics.

[25]  M. Schatz,et al.  Algorithms Gage: a Critical Evaluation of Genome Assemblies and Assembly Material Supplemental , 2008 .

[26]  Mark D. Robinson,et al.  Moderated statistical tests for assessing differences in tag abundance , 2007, Bioinform..