Next generation transcriptomes for next generation genomes using est2assembly

BackgroundThe decreasing costs of capillary-based Sanger sequencing and next generation technologies, such as 454 pyrosequencing, have prompted an explosion of transcriptome projects in non-model species, where even shallow sequencing of transcriptomes can now be used to examine a range of research questions. This rapid growth in data has outstripped the ability of researchers working on non-model species to analyze and mine transcriptome data efficiently.ResultsHere we present a semi-automated platform 'est2assembly' that processes raw sequence data from Sanger or 454 sequencing into a hybrid de-novo assembly, annotates it and produces GMOD compatible output, including a SeqFeature database suitable for GBrowse. Users are able to parameterize assembler variables, judge assembly quality and determine the optimal assembly for their specific needs. We used est2assembly to process Drosophila and Bicyclus public Sanger EST data and then compared them to published 454 data as well as eight new insect transcriptome collections.ConclusionsAnalysis of such a wide variety of data allows us to understand how these new technologies can assist EST project design. We determine that assembler parameterization is as essential as standardized methods to judge the output of ESTs projects. Further, even shallow sequencing using 454 produces sufficient data to be of wide use to the community. est2assembly is an important tool to assist manual curation for gene models, an important resource in their own right but especially for species which are due to acquire a genome project using Next Generation Sequencing.

[1]  D. Heckel,et al.  A genomic approach to understanding Heliothis and Helicoverpa resistance to chemical and biological insecticides , 1998 .

[2]  H. Shaffer,et al.  Developing markers for multilocus phylogenetics in non-model organisms: A test case with turtles. , 2008, Molecular phylogenetics and evolution.

[3]  J. Marden,et al.  Rapid transcriptome characterization for a nonmodel organism using 454 pyrosequencing , 2008, Molecular ecology.

[4]  Hiroyuki Ogata,et al.  KEGG: Kyoto Encyclopedia of Genes and Genomes , 1999, Nucleic Acids Res..

[5]  M. Boguski,et al.  dbEST — database for “expressed sequence tags” , 1993, Nature Genetics.

[6]  Tatiana Tatusova,et al.  NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins , 2004, Nucleic Acids Res..

[7]  Hiroaki Abe,et al.  The genetics and genomics of the silkworm, Bombyx mori. , 2005, Annual review of entomology.

[8]  Igor V Tetko,et al.  Separation of sequences from host-pathogen interface using triplet nucleotide frequencies. , 2007, Fungal genetics and biology : FG & B.

[9]  Nicolien Pul,et al.  A Gene-Based Linkage Map for Bicyclus anynana Butterflies Allows for a Comprehensive Analysis of Synteny with the Lepidopteran Reference Genome , 2009, PLoS genetics.

[10]  A. Papanicolaou,et al.  Butterfly genomics eclosing , 2008, Heredity.

[11]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[12]  Heiko Vogel,et al.  Characterization of a Hotspot for Mimicry: Assembly of a Butterfly Wing Transcriptome to Genomic Sequence at the Hmyb/sb Locus , 2022 .

[13]  Mark L. Blaxter,et al.  annot8r: GO, EC and KEGG annotation of EST datasets , 2008, BMC Bioinformatics.

[14]  B. Roe,et al.  Pyrosequence analysis of expressed sequence tags for Manduca sexta hemolymph proteins involved in immune responses. , 2008, Insect biochemistry and molecular biology.

[15]  Amos Bairoch,et al.  The ENZYME database in 2000 , 2000, Nucleic Acids Res..

[16]  Yoshiaki Nagamura,et al.  The genome sequence of silkworm, Bombyx mori. , 2004, DNA research : an international journal for rapid publication of reports on genes and genomes.

[17]  Mark L. Blaxter,et al.  PartiGene-constructing partial genomes , 2004, Bioinform..

[18]  N. M. van Straalen,et al.  An Introduction to Ecological Genomics , 2006 .

[19]  X. Huang,et al.  CAP3: A DNA sequence assembly program. , 1999, Genome research.

[20]  Igor V. Tetko,et al.  Support vector machines for separation of mixed plant?Cpathogen EST collections based on codon usage , 2005, Bioinform..

[21]  Thomas Wetter,et al.  Genome Sequence Assembly Using Trace Signals and Additional Sequence Information , 1999, German Conference on Bioinformatics.

[22]  T. Shimada,et al.  The construction of an EST database for Bombyx mori and its application , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[23]  B. Roitberg,et al.  Insect Chemical Ecology: An Evolutionary Approach , 1992 .

[24]  Gregory R. Madey,et al.  VectorBase: a data resource for invertebrate vector genomics , 2008, Nucleic Acids Res..

[25]  S. Schuster Next-generation sequencing transforms today's biology , 2008, Nature Methods.

[26]  Matthew R. Pocock,et al.  The Bioperl toolkit: Perl modules for the life sciences. , 2002, Genome research.

[27]  Jennifer Daub,et al.  Expressed sequence tags: medium-throughput protocols. , 2004, Methods in molecular biology.

[28]  J. Mullikin,et al.  SSAHA: a fast search method for large DNA databases. , 2001, Genome research.

[29]  R. ffrench-Constant,et al.  Pyrosequencing of the midgut transcriptome of the poplar leaf beetle Chrysomela tremulae reveals new gene families in Coleoptera. , 2009, Insect biochemistry and molecular biology.

[30]  Mark L. Blaxter,et al.  ButterflyBase: a platform for lepidopteran genomics , 2007, Nucleic Acids Res..

[31]  T. Vision,et al.  The molecular ecologist's guide to expressed sequence tags , 2006, Molecular ecology.

[32]  Xiaohui Wu,et al.  Predictive modeling of plant messenger RNA polyadenylation sites , 2007, BMC Bioinformatics.

[33]  Marek J. Sergot,et al.  SEAN: SNP prediction and display program utilizing EST sequence clusters , 2006, Bioinform..

[34]  W R Pearson,et al.  Comparison of DNA sequences with protein sequences. , 1997, Genomics.

[35]  G. Weinstock,et al.  The genome of Apis mellifera: dialog between linkage mapping and sequence assembly , 2007, Genome Biology.

[36]  P. Green,et al.  Analysis of expressed sequence tags indicates 35,000 human genes , 2000, Nature Genetics.

[37]  S. Lewis,et al.  The generic genome browser: a building block for a model organism system database. , 2002, Genome research.

[38]  Elizabeth M. Smigielski,et al.  dbSNP: the NCBI database of genetic variation , 2001, Nucleic Acids Res..

[39]  Chris Mungall,et al.  A Chado case study: an ontology-based modular schema for representing genome-associated biological information , 2007, ISMB/ECCB.

[40]  T. Wetter,et al.  Using the miraEST assembler for reliable and automated mRNA transcript assembly and SNP detection in sequenced ESTs. , 2004, Genome research.

[41]  I. Longden,et al.  EMBOSS: the European Molecular Biology Open Software Suite. , 2000, Trends in genetics : TIG.

[42]  V TetkoIgor,et al.  Support vector machines for separation of mixed plant--pathogen EST collections based on codon usage , 2005 .

[43]  Mark L. Blaxter,et al.  prot4EST: Translating Expressed Sequence Tags from neglected genomes , 2004, BMC Bioinformatics.

[44]  T. Miller,et al.  Evaluation of Methods for Extracting Xylella fastidiosa DNA from the Glassy-Winged Sharpshooter , 2004, Journal of economic entomology.

[45]  S. Rudd Expressed sequence tags: alternative or complement to whole genome sequences? , 2003, Trends in plant science.

[46]  Jian Wang,et al.  SilkDB: a knowledgebase for silkworm biology and genomics , 2004, Nucleic Acids Res..

[47]  Evgeny M. Zdobnov,et al.  VectorBase: a home for invertebrate vectors of human pathogens , 2006, Nucleic Acids Res..

[48]  Olivier Harismendy,et al.  Method for improving sequence coverage uniformity of targeted genomic intervals amplified by LR-PCR using Illumina GA sequencing-by-synthesis technology. , 2009, BioTechniques.

[49]  R. ffrench-Constant,et al.  Pyrosequencing the Manduca sexta larval midgut transcriptome: messages for digestion, detoxification and defence , 2010, Insect molecular biology.

[50]  Yonghua Li,et al.  BeetleBase: the model organism database for Tribolium castaneum , 2006, Nucleic Acids Res..

[51]  Sergio Verjovski-Almeida,et al.  ESTWeb: bioinformatics services for EST sequencing projects , 2003, Bioinform..

[52]  Mark L. Blaxter,et al.  Making sense of EST sequences by CLOBBing them , 2002, BMC Bioinformatics.

[53]  Sean B. Carroll,et al.  "Development, Plasticity and Evolution of Butterfly Eyespot Patterns" (1996), by Paul M. Brakefield et al. , 2013 .

[54]  J. Jurka,et al.  Repbase Update, a database of eukaryotic repetitive elements , 2005, Cytogenetic and Genome Research.

[55]  Madeline A. Crosby,et al.  FlyBase: genes and gene models , 2004, Nucleic Acids Res..

[56]  Adam Godzik,et al.  Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences , 2006, Bioinform..

[57]  Chris D Jiggins,et al.  Genomic tools and cDNA derived markers for butterflies , 2005, Molecular ecology.