Comparing de novo assemblers for 454 transcriptome data

BackgroundRoche 454 pyrosequencing has become a method of choice for generating transcriptome data from non-model organisms. Once the tens to hundreds of thousands of short (250-450 base) reads have been produced, it is important to correctly assemble these to estimate the sequence of all the transcripts. Most transcriptome assembly projects use only one program for assembling 454 pyrosequencing reads, but there is no evidence that the programs used to date are optimal. We have carried out a systematic comparison of five assemblers (CAP3, MIRA, Newbler, SeqMan and CLC) to establish best practices for transcriptome assemblies, using a new dataset from the parasitic nematode Litomosoides sigmodontis.ResultsAlthough no single assembler performed best on all our criteria, Newbler 2.5 gave longer contigs, better alignments to some reference sequences, and was fast and easy to use. SeqMan assemblies performed best on the criterion of recapitulating known transcripts, and had more novel sequence than the other assemblers, but generated an excess of small, redundant contigs. The remaining assemblers all performed almost as well, with the exception of Newbler 2.3 (the version currently used by most assembly projects), which generated assemblies that had significantly lower total length. As different assemblers use different underlying algorithms to generate contigs, we also explored merging of assemblies and found that the merged datasets not only aligned better to reference sequences than individual assemblies, but were also more consistent in the number and size of contigs.ConclusionsTranscriptome assemblies are smaller than genome assemblies and thus should be more computationally tractable, but are often harder because individual contigs can have highly variable read coverage. Comparing single assemblers, Newbler 2.5 performed best on our trial data set, but other assemblers were closely comparable. Combining differently optimal assemblies from different programs however gave a more credible final product, and this strategy is recommended.

[1]  Inanç Birol,et al.  De novo transcriptome assembly with ABySS , 2009, Bioinform..

[2]  D. Guo,et al.  Global characterization of Artemisia annua glandular trichome transcriptome using 454 pyrosequencing , 2009, BMC Genomics.

[3]  Tony Hunter,et al.  Microarray and cDNA sequence analysis of transcription during nerve-dependent limb regeneration , 2009, BMC Biology.

[4]  J. Ohlrogge,et al.  Sampling the Arabidopsis Transcriptome with Massively Parallel Pyrosequencing1[W][OA] , 2007, Plant Physiology.

[5]  J. Marden,et al.  Rapid transcriptome characterization for a nonmodel organism using 454 pyrosequencing , 2008, Molecular ecology.

[6]  Johan A. Grahnen,et al.  Transcriptome sequencing in an ecologically important tree species: assembly, annotation, and marker discovery , 2010, BMC Genomics.

[7]  Transcript Profiling by 3′-Untranslated Region Sequencing Resolves Expression of Gene Families1[W][OA] , 2007, Plant Physiology.

[8]  Darren A. Natale,et al.  The COG database: an updated version includes eukaryotes , 2003, BMC Bioinformatics.

[9]  J. Feder,et al.  Sympatric ecological speciation meets pyrosequencing: sampling the transcriptome of the apple maggot Rhagoletis pomonella , 2009, BMC Genomics.

[10]  Yi Zhang,et al.  Comparison of the transcriptomes of American chestnut (Castanea dentata) and Chinese chestnut (Castanea mollissima) in response to the chestnut blight infection , 2009, BMC Plant Biology.

[11]  Shaun D Jackman,et al.  Assembling genomes using short-read sequencing technology , 2010, Genome Biology.

[12]  R. ffrench-Constant,et al.  Pyrosequencing of the midgut transcriptome of the poplar leaf beetle Chrysomela tremulae reveals new gene families in Coleoptera. , 2009, Insect biochemistry and molecular biology.

[13]  Paul D. Shaw,et al.  BIOINFORMATICS APPLICATIONS NOTE , 2022 .

[14]  Alexie Papanicolaou,et al.  Next generation transcriptomes for next generation genomes using est2assembly , 2009, BMC Bioinformatics.

[15]  B. Roe,et al.  A Database of Expressed Genes From Cochliomyia hominivorax (Diptera: Calliphoridae) , 2009, Journal of medical entomology.

[16]  C. Schlötterer,et al.  Mapping Accuracy of Short Reads from Massively Parallel Sequencing and the Implications for Quantitative Expression Profiling , 2009, PloS one.

[17]  R. ffrench-Constant,et al.  Pyrosequencing the Manduca sexta larval midgut transcriptome: messages for digestion, detoxification and defence , 2010, Insect molecular biology.

[18]  R. Reinhardt,et al.  A 454 sequencing approach for large scale phylogenomic analysis of the common emperor scorpion (Pandinus imperator). , 2009, Molecular phylogenetics and evolution.

[19]  Ying Wang,et al.  Development of a EST dataset and characterization of EST-SSRs in a traditional Chinese medicinal plant, Epimedium sagittatum (Sieb. Et Zucc.) Maxim , 2010, BMC Genomics.

[20]  L. Herrera-Estrella,et al.  Deep sampling of the Palomero maize transcriptome by a high throughput strategy of pyrosequencing , 2009, BMC Genomics.

[21]  Melody S Clark,et al.  Insights into shell deposition in the Antarctic bivalve Laternula elliptica: gene discovery in the mantle transcriptome using 454 pyrosequencing , 2010, BMC Genomics.

[22]  B. Haas,et al.  Sequencing Medicago truncatula expressed sequenced tags using 454 Life Sciences technology , 2006, BMC Genomics.

[23]  Heiko Vogel,et al.  Characterization of a Hotspot for Mimicry: Assembly of a Butterfly Wing Transcriptome to Genomic Sequence at the Hmyb/sb Locus , 2022 .

[24]  Naomi S. Altman,et al.  Comparison of next generation sequencing technologies for transcriptome characterization , 2009 .

[25]  Thomas Wetter,et al.  Genome Sequence Assembly Using Trace Signals and Additional Sequence Information , 1999, German Conference on Bioinformatics.

[26]  T. Bekel,et al.  Open Access Research Article Transcriptome Sequencing and Comparative Transcriptome Analysis of the Scleroglucan Producer Sclerotium Rolfsii , 2022 .

[27]  T. Ravasi,et al.  Rapid transcriptome and proteome profiling of a non‐model marine invertebrate, Bugula neritina , 2010, Proteomics.

[28]  Ying Li,et al.  De novo sequencing and analysis of the American ginseng root transcriptome using a GS FLX Titanium platform to discover putative genes involved in ginsenoside biosynthesis , 2010, BMC Genomics.

[29]  James R. Knight,et al.  Genome sequencing in microfabricated high-density picolitre reactors , 2005, Nature.

[30]  Jonathan E. Allen,et al.  Draft Genome of the Filarial Nematode Parasite Brugia malayi , 2007, Science.

[31]  J. Mullikin,et al.  SSAHA: a fast search method for large DNA databases. , 2001, Genome research.

[32]  W. J. Kent,et al.  BLAT--the BLAST-like alignment tool. , 2002, Genome research.

[33]  P. Bouffard,et al.  Combining next-generation pyrosequencing with microarray for large scale expression analysis in non-model species , 2009, BMC Genomics.

[34]  C. Delwiche,et al.  Uncovering the evolutionary origin of plant molecular processes: comparison of Coleochaete (Coleochaetales) and Spirogyra (Zygnematales) transcriptomes , 2010, BMC Plant Biology.

[35]  Evandro Novaes,et al.  High-throughput gene and SNP discovery in Eucalyptus grandis, an uncharacterized genome , 2008, BMC Genomics.

[36]  John Quackenbush,et al.  TIGR Gene Indices clustering tools (TGICL): a software system for fast clustering of large EST datasets , 2003, Bioinform..

[37]  E. Birney,et al.  Velvet: algorithms for de novo short read assembly using de Bruijn graphs. , 2008, Genome research.

[38]  W Brad Barbazuk,et al.  Gene discovery and annotation using LCM-454 transcriptome sequencing. , 2006, Genome research.

[39]  X. Huang,et al.  CAP3: A DNA sequence assembly program. , 1999, Genome research.

[40]  Anton J. Enright,et al.  An efficient algorithm for large-scale detection of protein families. , 2002, Nucleic acids research.

[41]  E. Kristiansson,et al.  Characterization of the Zoarces viviparus liver transcriptome using massively parallel pyrosequencing , 2009, BMC Genomics.

[42]  Shengyue Wang,et al.  Massively parallel pyrosequencing-based transcriptome analyses of small brown planthopper (Laodelphax striatellus), a vector insect transmitting rice stripe virus (RSV) , 2010, BMC Genomics.

[43]  A. Hoerauf,et al.  Of Mice, Cattle, and Humans: The Immunology and Treatment of River Blindness , 2008, PLoS neglected tropical diseases.

[44]  Jade Buchanan-Carter,et al.  Sequencing and de novo analysis of a coral larval transcriptome using 454 GSFlx , 2009, BMC Genomics.

[45]  C. Buell,et al.  Analysis of the Pythium ultimum transcriptome using Sanger and Pyrosequencing approaches , 2008, BMC Genomics.

[46]  D. Shoemaker,et al.  Gene discovery using massively parallel pyrosequencing to develop ESTs for the flesh fly Sarcophaga crassipalpis , 2009, BMC Genomics.

[47]  Loren H. Rieseberg,et al.  SCARF: maximizing next-generation EST assemblies for evolutionary and population genomic analyses , 2009, Bioinform..

[48]  C. Soderlund,et al.  PAVE: Program for assembling and viewing ESTs , 2009, BMC Genomics.

[49]  Christopher J. Lee,et al.  A transcriptional sketch of a primary human breast cancer by 454 deep sequencing , 2009, BMC Genomics.

[50]  Jan Gorodkin,et al.  454 pyrosequencing based transcriptome analysis of Zygaena filipendulae with focus on genes involved in biosynthesis of cyanogenic glucosides , 2009, BMC Genomics.

[51]  Simon Swindell,et al.  Sequence Data Analysis Guidebook , 1996 .

[52]  Ross S Hall,et al.  Differences in transcription between free-living and CO2-activated third-stage larvae of Haemonchus contortus , 2010, BMC Genomics.

[53]  Winston A Hide,et al.  A comprehensive approach to clustering of expressed human gene sequence: the sequence tag alignment and consensus knowledge base. , 1999, Genome research.