Optimization of de novo transcriptome assembly from next-generation sequencing data.

Transcriptome analysis has important applications in many biological fields. However, assembling a transcriptome without a known reference remains a challenging task requiring algorithmic improvements. We present two methods for substantially improving transcriptome de novo assembly. The first method relies on the observation that the use of a single k-mer length by current de novo assemblers is suboptimal to assemble transcriptomes where the sequence coverage of transcripts is highly heterogeneous. We present the Multiple-k method in which various k-mer lengths are used for de novo transcriptome assembly. We demonstrate its good performance by assembling de novo a published next-generation transcriptome sequence data set of Aedes aegypti, using the existing genome to check the accuracy of our method. The second method relies on the use of a reference proteome to improve the de novo assembly. We developed the Scaffolding using Translation Mapping (STM) method that uses mapping against the closest available reference proteome for scaffolding contigs that map onto the same protein. In a controlled experiment using simulated data, we show that the STM method considerably improves the assembly, with few errors. We applied these two methods to assemble the transcriptome of the non-model catfish Loricaria gr. cataphracta. Using the Multiple-k and STM methods, the assembly increases in contiguity and in gene identification, showing that our methods clearly improve quality and can be widely used. The new methods were used to assemble successfully the transcripts of the core set of genes regulating tooth development in vertebrates, while classic de novo assembly failed.

[1]  X. Huang,et al.  A contig assembly program based on sensitive detection of fragment overlaps. , 1992, Genomics.

[2]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[3]  Haixu Tang,et al.  Fragment assembly with double-barreled data , 2001, ISMB.

[4]  J. Sire Teeth outside the mouth in teleost fishes: how to benefit from a developmental accident , 2001, Evolution & development.

[5]  P. Pevzner,et al.  An Eulerian path approach to DNA fragment assembly , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[6]  Dirk Steinke,et al.  Novel Relationships Among Ten Fish Model Species Revealed Based on a Phylogenomic Analysis Using ESTs , 2006, Journal of Molecular Evolution.

[7]  James R. Knight,et al.  Genome sequencing in microfabricated high-density picolitre reactors , 2005, Nature.

[8]  Adam Godzik,et al.  Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences , 2006, Bioinform..

[9]  J. Inoue,et al.  The mitochondrial genome of spotted green pufferfish Tetraodon nigroviridis (Teleostei: Tetraodontiformes) and divergence time estimation among model organisms in fishes. , 2006, Genes & genetic systems.

[10]  Alfried P Vogler,et al.  Dense taxonomic EST sampling and its applications for molecular systematics of the Coleoptera (beetles). , 2006, Molecular biology and evolution.

[11]  Jay Shendure,et al.  Multiplex amplification of large sets of human exons , 2007, Nature Methods.

[12]  P. Schnable,et al.  SNP discovery via 454 transcriptome sequencing , 2007, The Plant journal : for cell and molecular biology.

[13]  Evgeny M. Zdobnov,et al.  Genome Sequence of Aedes aegypti, a Major Arbovirus Vector , 2007, Science.

[14]  Matthew E Hudson,et al.  Wasp Gene Expression Supports an Evolutionary Link Between Maternal Behavior and Eusociality , 2007, Science.

[15]  E. Birney,et al.  Velvet: algorithms for de novo short read assembly using de Bruijn graphs. , 2008, Genome research.

[16]  Mark J. P. Chaisson,et al.  Short read fragment assembly of bacterial genomes. , 2008, Genome research.

[17]  Hunting hidden transcripts , 2008, Nature Methods.

[18]  J. Jackson,et al.  Next-generation pyrosequencing of gonad transcriptomes in the polyploid lake sturgeon (Acipenser fulvescens): the relative merits of normalization and rarefaction in gene discovery , 2009, BMC Genomics.

[19]  S. Schuster Next-generation sequencing transforms today's biology , 2008, Nature Methods.

[20]  Steven Salzberg,et al.  Gene-Boosted Assembly of a Novel Bacterial Genome from Very Short Reads , 2008, PLoS Comput. Biol..

[21]  David Hernández,et al.  De novo bacterial genome sequencing: millions of very short reads assembled on a desktop computer. , 2008, Genome research.

[22]  M. Gerstein,et al.  The Transcriptional Landscape of the Yeast Genome Defined by RNA Sequencing , 2008, Science.

[23]  J. Marden,et al.  Rapid transcriptome characterization for a nonmodel organism using 454 pyrosequencing , 2008, Molecular ecology.

[24]  David Q. Matus,et al.  Broad phylogenomic sampling improves resolution of the animal tree of life , 2008, Nature.

[25]  C. Nusbaum,et al.  ALLPATHS: de novo assembly of whole-genome shotgun microreads. , 2008, Genome research.

[26]  R. Durbin,et al.  Mapping Quality Scores Mapping Short Dna Sequencing Reads and Calling Variants Using P

, 2022 .

[27]  Christian Schlötterer,et al.  Gene expression profiling by massively parallel sequencing. , 2007, Genome research.

[28]  Patrick J Biggs,et al.  An approach to transcriptome analysis of non-model organisms using short-read sequences. , 2008, Genome informatics. International Conference on Genome Informatics.

[29]  Fabien Burki,et al.  Phylogenomics reveals a new ‘megagroup’ including most photosynthetic eukaryotes , 2008, Biology Letters.

[30]  B. Williams,et al.  Mapping and quantifying mammalian transcriptomes by RNA-Seq , 2008, Nature Methods.

[31]  Vilmos Ágoston,et al.  Deep sequencing of the zebrafish transcriptome response to mycobacterium infection. , 2009, Molecular immunology.

[32]  A. Gnirke,et al.  ALLPATHS 2: small genomes assembled accurately and with high continuity from short paired reads , 2009, Genome Biology.

[33]  Steven J. M. Jones,et al.  Abyss: a Parallel Assembler for Short Read Sequence Data Material Supplemental Open Access , 2022 .

[34]  Tissue Compartment Analysis for Biomarker Discovery by Gene Expression Profiling , 2009, PloS one.

[35]  Ryan D. Morin,et al.  Next-generation tag sequencing for cancer gene expression profiling. , 2009, Genome research.

[36]  Catalin C. Barbacioru,et al.  mRNA-Seq whole-transcriptome analysis of a single cell , 2009, Nature Methods.

[37]  Josephine A. Reinhardt,et al.  De novo assembly using low-coverage short read sequence data from the rice pathogen Pseudomonas syringae pv. oryzae. , 2009, Genome research.

[38]  Next-generation sequencing reveals complex relationships between the epigenome and transcriptome in maize. , 2009, Plant signaling & behavior.

[39]  Bradley J. Main,et al.  BMC Genomics BioMed Central Methodology article Allele-specific expression assays using Solexa , 2009 .

[40]  G. Fraser,et al.  An Ancient Gene Network Is Co-opted for Teeth on Old and New Jaws , 2009, PLoS biology.

[41]  Mark Johnston,et al.  Benchmarking next-generation transcriptome sequencing for functional and evolutionary genomics. , 2009, Molecular biology and evolution.

[42]  M. Gerstein,et al.  RNA-Seq: a revolutionary tool for transcriptomics , 2009, Nature Reviews Genetics.

[43]  D. Shoemaker,et al.  Gene discovery using massively parallel pyrosequencing to develop ESTs for the flesh fly Sarcophaga crassipalpis , 2009, BMC Genomics.

[44]  T. Fennell,et al.  Targeted next-generation sequencing of a cancer transcriptome enhances detection of sequence variants and novel fusion transcripts , 2009, Genome Biology.

[45]  Michael Q. Zhang,et al.  Updates to the RMAP short-read mapping software , 2009, Bioinform..

[46]  J. Montoya-Burgos,et al.  Transcriptome screen for fast evolving genes by Inter-Specific Selective Hybridization (ISSH) , 2010, BMC Genomics.

[47]  R. Guigó,et al.  Transcriptome genetics using second generation sequencing in a Caucasian population , 2010, Nature.