Benchmarking next-generation transcriptome sequencing for functional and evolutionary genomics.

Next-generation sequencing has opened the door to genomic analysis of nonmodel organisms. Technologies generating long-sequence reads (200-400 bp) are increasingly used in evolutionary studies of nonmodel organisms, but the short-sequence reads (30-50 bp) that can be produced at lower cost are thought to be of limited utility for de novo sequencing applications. Here, we tested this assumption by short-read sequencing the transcriptomes of the tropical disease vectors Aedes aegypti and Anopheles gambiae, for which complete genome sequences are available. Comparison of our results to the reference genomes allowed us to accurately evaluate the quantity, quality, and functional and evolutionary information content of our "test" data. We produced more than 0.7 billion nucleotides of sequenced data per species that assembled into more than 21,000 test contigs larger than 100 bp per species and covered approximately 27% of the Aedes reference transcriptome. Remarkably, the substitution error rate in the test contigs was approximately 0.25% per site, with very few indels or assembly errors. Test contigs of both species were enriched for genes involved in energy production and protein synthesis and underrepresented in genes involved in transcription and differentiation. Ortholog prediction using the test contigs was accurate across hundreds of millions of years of evolution. Our results demonstrate the considerable utility of short-read transcriptome sequencing for genomic studies of nonmodel organisms and suggest an approach for assessing the information content of next-generation data for evolutionary studies.

[1]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[2]  International Human Genome Sequencing Consortium Initial sequencing and analysis of the human genome , 2001, Nature.

[3]  Rob DeSalle,et al.  ESTimating plant phylogeny: lessons from partitioning , 2006, BMC Evolutionary Biology.

[4]  E. Birney,et al.  Velvet: algorithms for de novo short read assembly using de Bruijn graphs. , 2008, Genome research.

[5]  A. Krogh,et al.  Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes. , 2001, Journal of molecular biology.

[6]  B. Haas,et al.  Sequencing Medicago truncatula expressed sequenced tags using 454 Life Sciences technology , 2006, BMC Genomics.

[7]  H. Mewes,et al.  The FunCat, a functional annotation scheme for systematic classification of proteins from whole genomes. , 2004, Nucleic acids research.

[8]  Hunting hidden transcripts , 2008, Nature Methods.

[9]  A. James,et al.  angaGEDUCI: Anopheles gambiae gene expression database with integrated comparative algorithms for identifying conserved DNA motifs in promoter sequences , 2006, BMC Genomics.

[10]  John Vontas,et al.  The Anopheles gambiae detoxification chip: a highly specific microarray to study metabolic-based insecticide resistance in malaria vectors. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[11]  Evandro Novaes,et al.  High-throughput gene and SNP discovery in Eucalyptus grandis, an uncharacterized genome , 2008, BMC Genomics.

[12]  R. Vossen,et al.  Deep sequencing-based expression analysis shows major advances in robustness, resolution and inter-lab portability over five microarray platforms , 2008, Nucleic acids research.

[13]  Robert D. Finn,et al.  InterPro: the integrative protein signature database , 2008, Nucleic Acids Res..

[14]  Michael Q. Zhang,et al.  Using quality scores and longer reads improves accuracy of Solexa read mapping , 2008, BMC Bioinformatics.

[15]  Paul M. Choate,et al.  Evolution of the Insects , 2006 .

[16]  A. Abzhanov,et al.  Are we there yet? Tracking the development of new model systems. , 2008, Trends in genetics : TIG.

[17]  S. Carroll,et al.  Evolution of Key Cell Signaling and Adhesion Protein Families Predates Animal Origins , 2003, Science.

[18]  S. Schuster Next-generation sequencing transforms today's biology , 2008, Nature Methods.

[19]  J. Burke,et al.  EST-SSRs as a resource for population genetic analyses , 2007, Heredity.

[20]  S. Carroll,et al.  Animal Evolution and the Molecular Signature of Radiations Compressed in Time , 2005, Science.

[21]  Sarah A. Teichmann,et al.  DBD––taxonomically broad transcription factor predictions: new content and functionality , 2007, Nucleic Acids Res..

[22]  Evgeny M. Zdobnov,et al.  Genome Sequence of Aedes aegypti, a Major Arbovirus Vector , 2007, Science.

[23]  Stephen M. Mount,et al.  The genome sequence of Drosophila melanogaster. , 2000, Science.

[24]  F. Eisenhaber,et al.  A sensitive predictor for potential GPI lipid modification sites in fungal protein sequences and its application to genome-wide studies for Aspergillus nidulans, Candida albicans, Neurospora crassa, Saccharomyces cerevisiae and Schizosaccharomyces pombe. , 2004, Journal of molecular biology.

[25]  Lee T. Sam,et al.  Transcriptome Sequencing to Detect Gene Fusions in Cancer , 2009, Nature.

[26]  Jian Wang,et al.  The Genome Sequence of the Malaria Mosquito Anopheles gambiae , 2002, Science.

[27]  M. Gerstein,et al.  The Transcriptional Landscape of the Yeast Genome Defined by RNA Sequencing , 2008, Science.

[28]  Ying Wang,et al.  Insights into social insects from the genome of the honeybee Apis mellifera , 2006, Nature.

[29]  Catalin C. Barbacioru,et al.  mRNA-Seq whole-transcriptome analysis of a single cell , 2009, Nature Methods.

[30]  E. Koonin Orthologs, paralogs, and evolutionary genomics. , 2005, Annual review of genetics.

[31]  Dmitrij Frishman,et al.  PEDANT genome database: 10 years online , 2006, Nucleic Acids Res..

[32]  J. Marden,et al.  Rapid transcriptome characterization for a nonmodel organism using 454 pyrosequencing , 2008, Molecular ecology.

[33]  Gabor T. Marth,et al.  Whole-genome sequencing and variant discovery in C. elegans , 2008, Nature Methods.

[34]  J. Jurka,et al.  Microsatellites in different eukaryotic genomes: survey and analysis. , 2000, Genome research.

[35]  Robert A Holt,et al.  The new paradigm of flow cell sequencing. , 2008, Genome research.

[36]  Janet Hemingway,et al.  Evolution of Supergene Families Associated with Insecticide Resistance , 2002, Science.

[37]  Melanie A. Huntley,et al.  Evolutionary analysis of amino acid repeats across the genomes of 12 Drosophila species. , 2007, Molecular biology and evolution.

[38]  Nancy F. Hansen,et al.  Accurate Whole Human Genome Sequencing using Reversible Terminator Chemistry , 2008, Nature.

[39]  T. Vision,et al.  The molecular ecologist's guide to expressed sequence tags , 2006, Molecular ecology.

[40]  Akihiro Nakao,et al.  RPG: the Ribosomal Protein Gene database , 2004, Nucleic Acids Res..

[41]  I. Goodhead,et al.  Dynamic repertoire of a eukaryotic transcriptome surveyed at single-nucleotide resolution , 2008, Nature.

[42]  James A. McGowan,et al.  Microsatellite isolation and linkage group identification in the yellow fever mosquito Aedes aegypti. , 2007, The Journal of heredity.

[43]  Hugh M Robertson,et al.  G Protein-Coupled Receptors in Anopheles gambiae , 2002, Science.

[44]  G. Dimopoulos,et al.  Protocol for mosquito rearing (A. gambiae). , 2007, Journal of visualized experiments : JoVE.

[45]  Michael R. Green,et al.  Dissecting the Regulatory Circuitry of a Eukaryotic Genome , 1998, Cell.

[46]  E. Birney,et al.  Immunity-Related Genes and Gene Families in Anopheles gambiae , 2002, Science.

[47]  Rodrigo Lopez,et al.  Multiple sequence alignment with the Clustal series of programs , 2003, Nucleic Acids Res..

[48]  William Dirks,et al.  Early evolution of animal cell signaling and adhesion genes , 2006, Proceedings of the National Academy of Sciences.

[49]  David Q. Matus,et al.  Broad phylogenomic sampling improves resolution of the animal tree of life , 2008, Nature.

[50]  C. Nusbaum,et al.  ALLPATHS: de novo assembly of whole-genome shotgun microreads. , 2008, Genome research.

[51]  Matthew E Hudson,et al.  Wasp Gene Expression Supports an Evolutionary Link Between Maternal Behavior and Eusociality , 2007, Science.

[52]  Dawei Li,et al.  A Draft Sequence for the Genome of the Domesticated Silkworm ( Bombyx mori ) , 2004 .

[53]  James R. Knight,et al.  Genome sequencing in microfabricated high-density picolitre reactors , 2005, Nature.

[54]  Susan M. Huse,et al.  Accuracy and quality of massively parallel DNA pyrosequencing , 2007, Genome Biology.

[55]  E. Koonin Orthologs, Paralogs, and Evolutionary Genomics 1 , 2005 .

[56]  Niall J. Haslam,et al.  An analysis of the feasibility of short read sequencing , 2005, Nucleic acids research.

[57]  Peer Bork,et al.  The Genome of the Model Beetle and Pest Tribolium Castaneum Vertebrate-specific Orthologues Insect-specific Orthologues Homology Undetectable Similarity , 2022 .

[58]  Matthew E Hudson,et al.  Sequencing breakthroughs for genomic ecology and evolutionary biology , 2008, Molecular ecology resources.

[59]  F. James Rohlf,et al.  Biometry: The Principles and Practice of Statistics in Biological Research , 1969 .

[60]  I. Longden,et al.  EMBOSS: the European Molecular Biology Open Software Suite. , 2000, Trends in genetics : TIG.

[61]  Andreas Graner,et al.  454 sequencing put to the test using the complex genome of barley , 2006, BMC Genomics.

[62]  E. Mardis The impact of next-generation sequencing technology on genetics. , 2008, Trends in genetics : TIG.

[63]  S. Brunak,et al.  Improved prediction of signal peptides: SignalP 3.0. , 2004, Journal of molecular biology.

[64]  Alfried P Vogler,et al.  Dense taxonomic EST sampling and its applications for molecular systematics of the Coleoptera (beetles). , 2006, Molecular biology and evolution.

[65]  E. Mardis Next-generation DNA sequencing methods. , 2008, Annual review of genomics and human genetics.

[66]  Antonis Rokas,et al.  Harnessing genomics for evolutionary insights. , 2009, Trends in ecology & evolution.

[67]  Antonis Rokas,et al.  Comparative and functional characterization of intragenic tandem repeats in 10 Aspergillus genomes. , 2008, Molecular biology and evolution.

[68]  Rhys A. Farrer,et al.  De novo assembly of the Pseudomonas syringae pv. syringae B728a genome using Illumina/Solexa short sequence reads. , 2009, FEMS microbiology letters.