ConiferEST: an integrated bioinformatics system for data reprocessing and mining of conifer expressed sequence tags (ESTs)

BackgroundWith the advent of low-cost, high-throughput sequencing, the amount of public domain Expressed Sequence Tag (EST) sequence data available for both model and non-model organism is growing exponentially. While these data are widely used for characterizing various genomes, they also present a serious challenge for data quality control and validation due to their inherent deficiencies, particularly for species without genome sequences.DescriptionConiferEST is an integrated system for data reprocessing, visualization and mining of conifer ESTs. In its current release, Build 1.0, it houses 172,229 loblolly pine EST sequence reads, which were obtained from reprocessing raw DNA sequencer traces using our software – WebTraceMiner. The trace files were downloaded from NCBI Trace Archive. ConiferEST provides biologists unique, easy-to-use data visualization and mining tools for a variety of putative sequence features including cloning vector segments, adapter sequences, restriction endonuclease recognition sites, polyA and polyT runs, and their corresponding Phred quality values. Based on these putative features, verified sequence features such as 3' and/or 5' termini of cDNA inserts in either sense or non-sense strand have been identified in-silico. Interestingly, only 30.03% of the designated 3' ESTs were found to have an authenticated 5' terminus in the non-sense strand (i.e., polyT tails), while fewer than 5.34% of the designated 5' ESTs had a verified 5' terminus in the sense strand. Such previously ignored features provide valuable insight for data quality control and validation of error-prone ESTs, as well as the ability to identify novel functional motifs embedded in large EST datasets. We found that "double-termini adapters" were effective indicators of potential EST chimeras. For all sequences with in-silico verified termini/terminus, we used InterProScan to assign protein domain signatures, results of which are available for in-depth exploration using our biologist-friendly web interfaces.ConclusionConiferEST represents a unique and complementary public resource for EST data integration and mining in conifers by reprocessing raw DNA traces, identifying putative sequence features and determining and annotating in-silico verified features. Seamlessly integrated with other public resources, ConiferEST provides biologists powerful tools to verify data, visualize abnormalities, including EST chimeras, and explore large EST datasets.

[1]  Greg Elgar,et al.  Fugu ESTs: new resources for transcription analysis and genome annotation. , 2003, Genome research.

[2]  D B Davison,et al.  Alternative gene form discovery and candidate gene selection from gene indexing projects. , 1998, Genome research.

[3]  R. Varshney,et al.  Exploiting EST databases for the development and characterization of gene-derived SSR-markers in barley (Hordeum vulgare L.) , 2003, Theoretical and Applied Genetics.

[4]  E. Mardis,et al.  Generation and analysis of 280,000 human expressed sequence tags. , 1996, Genome research.

[5]  Robert D. Finn,et al.  New developments in the InterPro database , 2007, Nucleic Acids Res..

[6]  Evelyn Camon,et al.  The EMBL Nucleotide Sequence Database , 2004, Nucleic acids research.

[7]  R. Amasino Flowering time: a pathway that begins at the 3′ end , 2003, Current Biology.

[8]  Rolf Apweiler,et al.  InterProScan - an integration platform for the signature-recognition methods in InterPro , 2001, Bioinform..

[9]  P. Green,et al.  Base-calling of automated sequencer traces using phred. I. Accuracy assessment. , 1998, Genome research.

[10]  Saverio Alberti,et al.  Detection and analysis of spliced chimeric mRNAs in sequence databanks. , 2003, Nucleic acids research.

[11]  J. Dean,et al.  Water stress-responsive genes in loblolly pine (Pinus taeda) roots identified by analyses of expressed sequence tag libraries. , 2006, Tree physiology.

[12]  F. Chen,et al.  Robust analysis of 5 0 -transcript ends (5 0 -RATE): a novel technique for transcriptome analysis and genome annotation , 2006 .

[13]  J. Craig Venter,et al.  Sequence identification of 2,375 human brain genes , 1992, Nature.

[14]  S. Rudd Expressed sequence tags: alternative or complement to whole genome sequences? , 2003, Trends in plant science.

[15]  P. Green,et al.  Analysis of expressed sequence tags indicates 35,000 human genes , 2000, Nature Genetics.

[16]  J. Laroche,et al.  Large-scale statistical analysis of secondary xylem ESTs in pine , 2004, Plant Molecular Biology.

[17]  Jim Arlow,et al.  UML and the unified process , 2001 .

[18]  Rolf Apweiler,et al.  InterProScan: protein domains identifier , 2005, Nucleic Acids Res..

[19]  Travis J. Wheeler,et al.  Evaluating and improving cDNA sequence quality with cQC , 2005, Bioinform..

[20]  R. Sorek,et al.  A novel algorithm for computational identification of contaminated EST libraries. , 2003, Nucleic acids research.

[21]  Gang Wang,et al.  WebTraceMiner: a web service for processing and mining EST sequence trace files , 2007, Nucleic Acids Res..

[22]  Yongfeng Jin,et al.  Nontemplated nucleotide addition prior to polyadenylation: a comparison of Arabidopsis cDNA and genomic sequences. , 2004, RNA.

[23]  John Quackenbush,et al.  Gene Index analysis of the human genome estimates approximately 120,000 genes , 2000, Nature Genetics.

[24]  Bin Tian,et al.  Alternative polyadenylation of cyclooxygenase-2 , 2005, Nucleic acids research.

[25]  E. Kohn,et al.  An improved method for construction of directionally cloned cDNA libraries from microdissected cells. , 1998, Cancer research.

[26]  Liliana Favre,et al.  UML and the Unified Process , 2003 .

[27]  J. Cairney,et al.  Expressed Sequence Tags from loblolly pine embryos reveal similarities with angiosperm embryogenesis , 2006, Plant Molecular Biology.

[28]  P. Lijnzaad,et al.  A physical map of 30,000 human genes. , 1998, Science.

[29]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[30]  Haiming Wang,et al.  MAGIC-SPP: a database-driven DNA sequence processing package with associated management tools , 2006, BMC Bioinformatics.

[31]  P. Mitchell,et al.  mRNA stability in eukaryotes. , 2000, Current opinion in genetics & development.