Combining Transcriptome Assemblies from Multiple De Novo Assemblers in the Allo-Tetraploid Plant Nicotiana benthamiana

Background Nicotiana benthamiana is an allo-tetraploid plant, which can be challenging for de novo transcriptome assemblies due to homeologous and duplicated gene copies. Transcripts generated from such genes can be distinct yet highly similar in sequence, with markedly differing expression levels. This can lead to unassembled, partially assembled or mis-assembled contigs. Due to the different properties of de novo assemblers, no one assembler with any one given parameter space can re-assemble all possible transcripts from a transcriptome. Results In an effort to maximise the diversity and completeness of de novo assembled transcripts, we utilised four de novo transcriptome assemblers, TransAbyss, Trinity, SOAPdenovo-Trans, and Oases, using a range of k-mer sizes and different input RNA-seq read counts. We complemented the parameter space biologically by using RNA from 10 plant tissues. We then combined the output of all assemblies into a large super-set of sequences. Using a method from the EvidentialGene pipeline, the combined assembly was reduced from 9.9 million de novo assembled transcripts to about 235,000 of which about 50,000 were classified as primary. Metrics such as average bit-scores, feature response curves and the ability to distinguish paralogous or homeologous transcripts, indicated that the EvidentialGene processed assembly was of high quality. Of 35 RNA silencing gene transcripts, 34 were identified as assembled to full length, whereas in a previous assembly using only one assembler, 9 of these were partially assembled. Conclusions To achieve a high quality transcriptome, it is advantageous to implement and combine the output from as many different de novo assemblers as possible. We have in essence taking the ‘best’ output from each assembler while minimising sequence redundancy. We have also shown that simultaneous assessment of a variety of metrics, not just focused on contig length, is necessary to gauge the quality of assemblies.

[1]  Adam Godzik,et al.  Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences , 2006, Bioinform..

[2]  J. Dubcovsky,et al.  Wheat FT protein regulates VRN1 transcription through interactions with FDL2. , 2008, The Plant journal : for cell and molecular biology.

[3]  D. Gilbert,et al.  Gene-omes built from mRNA seq not genome DNA , 2016 .

[4]  Helga Thorvaldsdóttir,et al.  Integrative Genomics Viewer (IGV): high-performance genomics data visualization and exploration , 2012, Briefings Bioinform..

[5]  Inanç Birol,et al.  De novo transcriptome assembly with ABySS , 2009, Bioinform..

[6]  Cutoffs and k-mers: implications from a transcriptome study in allopolyploid plants , 2012, BMC Genomics.

[7]  Martin Vingron,et al.  Oases: robust de novo RNA-seq assembly across the dynamic range of expression levels , 2012, Bioinform..

[8]  Vincent Ranwez,et al.  Disentangling homeologous contigs in allo-tetraploid assembly: application to durum wheat , 2013, BMC Bioinformatics.

[9]  B. Mishra,et al.  Feature-by-Feature – Evaluating De Novo Sequence Assembly , 2012, PloS one.

[10]  The UniProt Consortium,et al.  Reorganizing the protein space at the Universal Protein Resource (UniProt) , 2011, Nucleic Acids Res..

[11]  Cristobal Uauy,et al.  Separating homeologs by phasing in the tetraploid wheat transcriptome , 2013, Genome Biology.

[12]  Steven J. M. Jones,et al.  Circos: an information aesthetic for comparative genomics. , 2009, Genome research.

[13]  The UniProt Consortium,et al.  Update on activities at the Universal Protein Resource (UniProt) in 2013 , 2012, Nucleic Acids Res..

[14]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[15]  Matthew R. Pocock,et al.  The Bioperl toolkit: Perl modules for the life sciences. , 2002, Genome research.

[16]  T. Itoh,et al.  Characterisation of the wheat (triticum aestivum L.) transcriptome by de novo assembly for the discovery of phosphate starvation-responsive genes: gene expression in Pi-stressed wheat , 2013, BMC Genomics.

[17]  John Quackenbush,et al.  TIGR Gene Indices clustering tools (TGICL): a software system for fast clustering of large EST datasets , 2003, Bioinform..

[18]  Steven L Salzberg,et al.  Fast gapped-read alignment with Bowtie 2 , 2012, Nature Methods.

[19]  P. Langridge,et al.  Transcriptome-scale homoeolog-specific transcript assemblies of bread wheat , 2012, BMC Genomics.

[20]  N. Friedman,et al.  Trinity: reconstructing a full-length transcriptome without a genome from RNA-Seq data , 2011, Nature Biotechnology.

[21]  Berat Z. Haznedaroglu,et al.  Optimization of de novo transcriptome assembly from high-throughput short read sequencing data improves functional annotation for non-model organisms , 2012, BMC Bioinformatics.

[22]  Z. Xie,et al.  Negative Feedback Regulation of Dicer-Like1 in Arabidopsis by microRNA-Guided mRNA Degradation , 2003, Current Biology.

[23]  Keith Bradnam,et al.  Assessing the gene space in draft genomes , 2008, Nucleic acids research.

[24]  M. Yano,et al.  Phytochrome mediates the external light signal to repress FT orthologs in photoperiodic flowering of rice. , 2002, Genes & development.

[25]  Roger P. Hellens,et al.  De Novo Transcriptome Sequence Assembly and Analysis of RNA Silencing Genes of Nicotiana benthamiana , 2013, PloS one.

[26]  Aureliano Bombarely,et al.  Deciphering the complex leaf transcriptome of the allotetraploid species Nicotiana tabacum: a phylogenomic perspective , 2012, BMC Genomics.

[27]  María Martín,et al.  Activities at the Universal Protein Resource (UniProt) , 2013, Nucleic Acids Res..

[28]  D. Bartel,et al.  A diverse and evolutionarily fluid set of microRNAs in Arabidopsis thaliana. , 2006, Genes & development.

[29]  Jialei Duan,et al.  Optimizing de novo common wheat transcriptome assembly using short-read RNA-Seq data , 2012, BMC Genomics.

[30]  Stephen A. Smith,et al.  Optimizing de novo assembly of short-read RNA-seq data for phylogenomics , 2013, BMC Genomics.

[31]  Xun Xu,et al.  SOAPdenovo-Trans: de novo transcriptome assembly with short RNA-Seq reads , 2013, Bioinform..

[32]  Tatiana A. Tatusova,et al.  NCBI Reference Sequences: current status, policy and new initiatives , 2008, Nucleic Acids Res..

[33]  Steven J. M. Jones,et al.  De novo assembly and analysis of RNA-seq data , 2010, Nature Methods.

[34]  E. Birney,et al.  Velvet: algorithms for de novo short read assembly using de Bruijn graphs. , 2008, Genome research.

[35]  X. Huang,et al.  CAP3: A DNA sequence assembly program. , 1999, Genome research.

[36]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[37]  A. Eamens,et al.  Virus-Induced Gene Silencing of Argonaute Genes in Nicotiana benthamiana Demonstrates That Extensive Systemic Silencing Requires Argonaute1-Like and Argonaute4-Like Genes1 , 2006, Plant Physiology.

[38]  J. Silberg,et al.  A transposase strategy for creating libraries of circularly permuted proteins , 2012, Nucleic acids research.

[39]  M. Peitsch,et al.  Reference genomes and transcriptomes of Nicotiana sylvestris and Nicotiana tomentosiformis , 2013, Genome Biology.

[40]  Keith Bradnam,et al.  CEGMA: a pipeline to accurately annotate core genes in eukaryotic genomes , 2007, Bioinform..

[41]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[42]  Colin N. Dewey,et al.  RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome , 2011, BMC Bioinformatics.

[43]  Mark Stitt,et al.  RobiNA: a user-friendly, integrated software solution for RNA-Seq-based transcriptomics , 2012, Nucleic Acids Res..

[44]  J. Cairney,et al.  A simple and efficient method for isolating RNA from pine trees , 1993, Plant Molecular Biology Reporter.

[45]  Peter B. McGarvey,et al.  UniRef: comprehensive and non-redundant UniProt reference clusters , 2007, Bioinform..

[46]  Xuan Li,et al.  Optimizing de novo transcriptome assembly from short-read RNA-Seq data: a comparative study , 2011, BMC Bioinformatics.

[47]  Bud Mishra,et al.  Reevaluating Assembly Evaluations with Feature Response Curves: GAGE and Assemblathons , 2012, PloS one.

[48]  Hadley Wickham,et al.  ggplot2 - Elegant Graphics for Data Analysis (2nd Edition) , 2017 .