Comparison of next generation sequencing technologies for transcriptome characterization

BackgroundWe have developed a simulation approach to help determine the optimal mixture of sequencing methods for most complete and cost effective transcriptome sequencing. We compared simulation results for traditional capillary sequencing with "Next Generation" (NG) ultra high-throughput technologies. The simulation model was parameterized using mappings of 130,000 cDNA sequence reads to the Arabidopsis genome (NCBI Accession SRA008180.19). We also generated 454-GS20 sequences and de novo assemblies for the basal eudicot California poppy (Eschscholzia californica) and the magnoliid avocado (Persea americana) using a variety of methods for cDNA synthesis.ResultsThe Arabidopsis reads tagged more than 15,000 genes, including new splice variants and extended UTR regions. Of the total 134,791 reads (13.8 MB), 119,518 (88.7%) mapped exactly to known exons, while 1,117 (0.8%) mapped to introns, 11,524 (8.6%) spanned annotated intron/exon boundaries, and 3,066 (2.3%) extended beyond the end of annotated UTRs. Sequence-based inference of relative gene expression levels correlated significantly with microarray data. As expected, NG sequencing of normalized libraries tagged more genes than non-normalized libraries, although non-normalized libraries yielded more full-length cDNA sequences. The Arabidopsis data were used to simulate additional rounds of NG and traditional EST sequencing, and various combinations of each. Our simulations suggest a combination of FLX and Solexa sequencing for optimal transcriptome coverage at modest cost. We have also developed ESTcalc http://fgp.huck.psu.edu/NG_Sims/ngsim.pl, an online webtool, which allows users to explore the results of this study by specifying individualized costs and sequencing characteristics.ConclusionNG sequencing technologies are a highly flexible set of platforms that can be scaled to suit different project goals. In terms of sequence coverage alone, the NG sequencing is a dramatic advance over capillary-based sequencing, but NG sequencing also presents significant challenges in assembly and sequence accuracy due to short read lengths, method-specific sequencing errors, and the absence of physical clones. These problems may be overcome by hybrid sequencing strategies using a mixture of sequencing methodologies, by new assemblers, and by sequencing more deeply. Sequencing and microarray outcomes from multiple experiments suggest that our simulator will be useful for guiding NG transcriptome sequencing projects in a wide range of organisms.

[1]  F. Sanger,et al.  DNA sequencing with chain-terminating inhibitors. , 1977, Proceedings of the National Academy of Sciences of the United States of America.

[2]  J. J. Greene,et al.  Identification of interferon-modulated proliferation-related cDNA sequences. , 1987, Proceedings of the National Academy of Sciences of the United States of America.

[3]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[4]  Eugene W. Myers,et al.  Basic local alignment search tool. Journal of Molecular Biology , 1990 .

[5]  A. Kerlavage,et al.  Complementary DNA sequencing: expressed sequence tags and human genome project , 1991, Science.

[6]  Ronald W. Davis,et al.  Quantitative Monitoring of Gene Expression Patterns with a Complementary DNA Microarray , 1995, Science.

[7]  V. Tarabykin,et al.  Inverted terminal repeats permit the average length of amplified DNA fragments to be regulated during preparation of cDNA libraries by polymerase chain reaction. , 1995, Analytical biochemistry.

[8]  The Arabidopsis Genome Initiative Analysis of the genome sequence of the flowering plant Arabidopsis thaliana , 2000, Nature.

[9]  Rithy K. Roth,et al.  Gene expression analysis by massively parallel signature sequencing (MPSS) on microbead arrays , 2000, Nature Biotechnology.

[10]  A. Chenchik,et al.  Reverse transcriptase template switching: a SMART approach for full-length cDNA library construction. , 2001, BioTechniques.

[11]  L. Kunkel,et al.  Gene expression comparison of biopsies from Duchenne muscular dystrophy (DMD) and normal skeletal muscle , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[12]  Ji Huang,et al.  [Serial analysis of gene expression]. , 2002, Yi chuan = Hereditas.

[13]  Christopher D Town,et al.  Development and evaluation of an Arabidopsis whole genome Affymetrix probe array. , 2004, The Plant journal : for cell and molecular biology.

[14]  Ji-Ping Z. Wang,et al.  EST clustering error evaluation and correction , 2004, Bioinform..

[15]  L. Hennig,et al.  Arabidopsis transcript profiling on Affymetrix GeneChip arrays , 2003, Plant Molecular Biology.

[16]  Jing Wang,et al.  Function-informed transcriptome analysis of Drosophila renal tubule , 2004, Genome Biology.

[17]  S. Lukyanov,et al.  Simple cDNA normalization using kamchatka crab duplex-specific nuclease. , 2004, Nucleic acids research.

[18]  Pamela S Soltis,et al.  Phylogeny and diversification of B-function MADS-box genes in angiosperms: evolutionary and functional implications of a 260-million-year-old duplication. , 2004, American journal of botany.

[19]  Ji-Ping Z. Wang,et al.  Gene capture prediction and overlap estimation in EST sequencing from one or multiple libraries , 2005, BMC Bioinformatics.

[20]  Hong Ma,et al.  Genome-wide expression profiling and identification of gene activities during early flower development in Arabidopsis , 2005, Plant Molecular Biology.

[21]  Elliot M. Meyerowitz,et al.  The early extra petals1 Mutant Uncovers a Role for MicroRNA miR164c in Regulating Petal Number in Arabidopsis , 2005, Current Biology.

[22]  J. Shendure,et al.  Materials and Methods Som Text Figs. S1 and S2 Tables S1 to S4 References Accurate Multiplex Polony Sequencing of an Evolved Bacterial Genome , 2022 .

[23]  James R. Knight,et al.  Genome sequencing in microfabricated high-density picolitre reactors , 2005, Nature.

[24]  Steven J. M. Jones,et al.  BMC Genomics BioMed Central Methodology article , 2006 .

[25]  B. Haas,et al.  Sequencing Medicago truncatula expressed sequenced tags using 454 Life Sciences technology , 2006, BMC Genomics.

[26]  Jens Nielsen,et al.  Global Transcriptional and Physiological Responses of Saccharomyces cerevisiae to Ammonium, l-Alanine, or l-Glutamine Limitation , 2006, Applied and Environmental Microbiology.

[27]  Richard Mott,et al.  Genomic clusters, putative pathogen recognition molecules, and antimicrobial genes are induced by infection of C. elegans with M. nematophilum. , 2006, Genome research.

[28]  A. Halpern,et al.  A Sanger/pyrosequencing hybrid approach for the generation of high-quality draft assemblies of marine microbial genomes. , 2006, Proceedings of the National Academy of Sciences of the United States of America.

[29]  S. Tanksley,et al.  EST database for early flower development in California poppy (Eschscholzia californica Cham., Papaveraceae) tags over 6000 genes from a basal eudicot , 2006, Plant Molecular Biology.

[30]  Shivakundan Singh Tej,et al.  MicroRNAs and other small RNAs enriched in the Arabidopsis RNA-dependent RNA polymerase-2 mutant. , 2006, Genome research.

[31]  J. Leebens-Mack,et al.  Complete plastid genome sequences of Drimys, Liriodendron, and Piper: implications for the phylogenetic relationships of magnoliids , 2006, BMC Evolutionary Biology.

[32]  D. Bentley,et al.  Whole-genome re-sequencing. , 2006, Current opinion in genetics & development.

[33]  Amit Dhingra,et al.  Rapid and accurate pyrosequencing of angiosperm plastid genomes , 2006, BMC Plant Biology.

[34]  Alexander F. Auch,et al.  Metagenomics to Paleogenomics: Large-Scale Sequencing of Mammoth DNA , 2006, Science.

[35]  B. Meyers,et al.  Construction of small RNA cDNA libraries for deep sequencing. , 2007, Methods.

[36]  Jay Shendure,et al.  Multiplex amplification of large sets of human exons , 2007, Nature Methods.

[37]  J. Leebens-Mack,et al.  Large-scale identification of microRNAs from a basal eudicot (Eschscholzia californica) and conservation in flowering plants. , 2007, The Plant journal : for cell and molecular biology.

[38]  Transcript Profiling by 3′-Untranslated Region Sequencing Resolves Expression of Gene Families1[W][OA] , 2007, Plant Physiology.

[39]  D. Soltis,et al.  Floral Developmental Morphology of Persea americana (Avocado, Lauraceae): The Oddities of Male Organ Identity , 2007, International Journal of Plant Sciences.

[40]  T. Vision,et al.  The molecular ecologist's guide to expressed sequence tags , 2006, Molecular ecology.

[41]  Vincent J. Magrini,et al.  Extending assembly of short DNA sequences to handle error , 2007, Bioinform..

[42]  Claude W dePamphilis,et al.  Conservation and divergence of microRNAs in Populus , 2007, BMC Genomics.

[43]  J. Ohlrogge,et al.  Sampling the Arabidopsis Transcriptome with Massively Parallel Pyrosequencing1[W][OA] , 2007, Plant Physiology.

[44]  A. Becker,et al.  Highly efficient virus-induced gene silencing (VIGS) in California poppy (Eschscholzia californica): an evaluation of VIGS as a strategy to obtain functional data from non-model plants. , 2007, Annals of botany.

[45]  René L. Warren,et al.  Assembling millions of short DNA sequences using SSAKE , 2006, Bioinform..

[46]  Juliane C. Dohm,et al.  SHARCGS, a fast and highly accurate short-read assembly algorithm for de novo genomic sequencing. , 2007, Genome research.

[47]  D. Soltis,et al.  Persea americana (avocado): bringing ancient flowers to fruit in the genomics era. , 2008, BioEssays : news and reviews in molecular, cellular and developmental biology.

[48]  J. Marden,et al.  Rapid transcriptome characterization for a nonmodel organism using 454 pyrosequencing , 2008, Molecular ecology.

[49]  Christian Schlötterer,et al.  Gene expression profiling by massively parallel sequencing. , 2007, Genome research.