Assessing De Novo transcriptome assembly metrics for consistency and utility

BackgroundTranscriptome sequencing and assembly represent a great resource for the study of non-model species, and many metrics have been used to evaluate and compare these assemblies. Unfortunately, it is still unclear which of these metrics accurately reflect assembly quality.ResultsWe simulated sequencing transcripts of Drosophila melanogaster. By assembling these simulated reads using both a “perfect” and a modern transcriptome assembler while varying read length and sequencing depth, we evaluated quality metrics to determine whether they 1) revealed perfect assemblies to be of higher quality, and 2) revealed perfect assemblies to be more complete as data quantity increased.Several commonly used metrics were not consistent with these expectations, including average contig coverage and length, though they became consistent when singletons were included in the analysis. We found several annotation-based metrics to be consistent and informative, including contig reciprocal best hit count and contig unique annotation count. Finally, we evaluated a number of novel metrics such as reverse annotation count, contig collapse factor, and the ortholog hit ratio, discovering that each assess assembly quality in unique ways.ConclusionsAlthough much attention has been given to transcriptome assembly, little research has focused on determining how best to evaluate assemblies, particularly in light of the variety of options available for read length and sequencing depth. Our results provide an important review of these metrics and give researchers tools to produce the highest quality transcriptome assemblies.

[1]  Cole Trapnell,et al.  Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. , 2010, Nature biotechnology.

[2]  P. McGettigan Transcriptomics in the RNA-seq era. , 2013, Current opinion in chemical biology.

[3]  Fuhong He,et al.  Modeling Transcriptome Based on Transcript-Sampling Data , 2008, PloS one.

[4]  Ben Ewen-Campen,et al.  De novo assembly and characterization of a maternal and developmental transcriptome for the emerging model crustacean Parhyale hawaiensis , 2011, BMC Genomics.

[5]  Evandro Novaes,et al.  High-throughput gene and SNP discovery in Eucalyptus grandis, an uncharacterized genome , 2008, BMC Genomics.

[6]  T. Dallman,et al.  Performance comparison of benchtop high-throughput sequencing platforms , 2012, Nature Biotechnology.

[7]  Björn Rotter,et al.  Short read Illumina data for the de novo assembly of a non-model snail species transcriptome (Radix balthica, Basommatophora, Pulmonata), and a comparison of assembler performance , 2011, BMC Genomics.

[8]  Jan Gorodkin,et al.  454 pyrosequencing based transcriptome analysis of Zygaena filipendulae with focus on genes involved in biosynthesis of cyanogenic glucosides , 2009, BMC Genomics.

[9]  Thomas Girke,et al.  What makes species unique? The contribution of proteins with obscure features , 2006, Genome Biology.

[10]  E. Bornberg-Bauer,et al.  Evaluating Characteristics of De Novo Assembly Software on 454 Transcriptome Data: A Simulation Approach , 2012, PloS one.

[11]  Scott J Emrich,et al.  Open Access Research Article Population-level Transcriptome Sequencing of Nonmodel Organisms Erynnis Propertius and Papilio Zelicaon , 2022 .

[12]  Gabriel Moreno-Hagelsieb,et al.  Choosing BLAST options for better detection of orthologs as reciprocal best hits , 2008, Bioinform..

[13]  S. Roth,et al.  The maternal and early embryonic transcriptome of the milkweed bug Oncopeltus fasciatus , 2011, BMC Genomics.

[14]  G. Luikart,et al.  Genomic patterns of introgression in rainbow and westslope cutthroat trout illuminated by overlapping paired‐end RAD sequencing , 2013, Molecular ecology.

[15]  J. Feder,et al.  Sympatric ecological speciation meets pyrosequencing: sampling the transcriptome of the apple maggot Rhagoletis pomonella , 2009, BMC Genomics.

[16]  P. Wincker,et al.  Bioinformatic analysis of ESTs collected by Sanger and pyrosequencing methods for a keystone forest tree species: oak , 2010, BMC Genomics.

[17]  C. Wheat Rapidly developing functional genomics in ecological model systems via 454 transcriptome sequencing , 2010, Genetica.

[18]  M. Blaxter,et al.  Comparing de novo assemblers for 454 transcriptome data , 2010, BMC Genomics.

[19]  Jade Buchanan-Carter,et al.  Sequencing and de novo analysis of a coral larval transcriptome using 454 GSFlx , 2009, BMC Genomics.

[20]  A. Weber,et al.  RNA-Seq Assembly – Are We There Yet? , 2012, Front. Plant Sci..

[21]  D. Shoemaker,et al.  Gene discovery using massively parallel pyrosequencing to develop ESTs for the flesh fly Sarcophaga crassipalpis , 2009, BMC Genomics.

[22]  J. Montoya-Burgos,et al.  Optimization of de novo transcriptome assembly from next-generation sequencing data. , 2010, Genome research.

[23]  C. Furusawa,et al.  Zipf's law in gene expression. , 2002, Physical review letters.

[24]  Olivier Lespinet,et al.  A general framework for optimization of probes for gene expression microarray and its application to the fungus Podospora anserina , 2010, BMC Research Notes.

[25]  W. Lu,et al.  Transcriptome Sequencing and Characterization for the Sea Cucumber Apostichopus japonicus (Selenka, 1867) , 2012, PloS one.

[26]  Ruiqiang Li,et al.  SilkDB v2.0: a platform for silkworm (Bombyx mori ) genome biology , 2009, Nucleic Acids Res..

[27]  Michael Boutros,et al.  The head-regeneration transcriptome of the planarian Schmidtea mediterranea , 2011, Genome Biology.

[28]  J. Marden,et al.  Rapid transcriptome characterization for a nonmodel organism using 454 pyrosequencing , 2008, Molecular ecology.

[29]  F. Hendrickx,et al.  De novo Transcriptome Assembly and SNP Discovery in the Wing Polymorphic Salt Marsh Beetle Pogonus chalceus (Coleoptera, Carabidae) , 2012, PloS one.

[30]  N. Friedman,et al.  Trinity: reconstructing a full-length transcriptome without a genome from RNA-Seq data , 2011, Nature Biotechnology.

[31]  C. Buell,et al.  Analysis of the Pythium ultimum transcriptome using Sanger and Pyrosequencing approaches , 2008, BMC Genomics.