WebTraceMiner: a web service for processing and mining EST sequence trace files

Expressed sequence tags (ESTs) remain a dominant approach for characterizing the protein-encoding portions of various genomes. Due to inherent deficiencies, they also present serious challenges for data quality control. Before GenBank submission, EST sequences are typically screened and trimmed of vector and adapter/linker sequences, as well as polyA/T tails. Removal of these sequences presents an obstacle for data validation of error-prone ESTs and impedes data mining of certain functional motifs, whose detection relies on accurate annotation of positional information for polyA tails added posttranscriptionally. As raw DNA sequence information is made increasingly available from public repositories, such as NCBI Trace Archive, new tools will be necessary to reanalyze and mine this data for new information. WebTraceMiner (www.conifergdb.org/software/wtm) was designed as a public sequence processing service for raw EST traces, with a focus on detection and mining of sequence features that help characterize 3′ and 5′ termini of cDNA inserts, including vector fragments, adapter/linker sequences, insert-flanking restriction endonuclease recognition sites and polyA or polyT tails. WebTraceMiner complements other public EST resources and should prove to be a unique tool to facilitate data validation and mining of error-prone ESTs (e.g. discovery of new functional motifs).

[1]  Jennifer W. Weller,et al.  ESTAP-an automated system for the analysis of EST data , 2003, Bioinform..

[2]  Anna V. Vlasova,et al.  preAssemble: a tool for automatic sequencer trace data processing , 2005, BMC Bioinformatics.

[3]  Sergio Verjovski-Almeida,et al.  ESTWeb: bioinformatics services for EST sequencing projects , 2003, Bioinform..

[4]  Michael Recce,et al.  PolyA_DB: a database for mammalian mRNA polyadenylation , 2004, Nucleic Acids Res..

[5]  Jennifer Daub,et al.  Expressed sequence tags: medium-throughput protocols. , 2004, Methods in molecular biology.

[6]  L. Wagner,et al.  21. UniGene: A Unified View of the Transcriptome , 2003 .

[7]  Haiming Wang,et al.  MAGIC-SPP: a database-driven DNA sequence processing package with associated management tools , 2006, BMC Bioinformatics.

[8]  P. Green,et al.  Base-calling of automated sequencer traces using phred. I. Accuracy assessment. , 1998, Genome research.

[9]  J. Aerts,et al.  POSA: Perl Objects for DNA Sequencing Data Analysis , 2004, BMC Genomics.

[10]  M. Boguski,et al.  dbEST — database for “expressed sequence tags” , 1993, Nature Genetics.

[11]  Jo McEntyre,et al.  The NCBI Handbook , 2002 .

[12]  Thomas L. Casavant,et al.  ESTprep: Preprocessing CDNA Sequence Reads , 2003, Bioinform..

[13]  R Staden,et al.  The staden sequence analysis package , 1996, Molecular biotechnology.

[14]  P. Ayoubi,et al.  PipeOnline 2.0: automated EST processing and functional data sorting. , 2002, Nucleic acids research.

[15]  Daniel Lee,et al.  The TIGR Gene Indices: analysis of gene transcript sequences in highly sampled eukaryotic species , 2001, Nucleic Acids Res..

[16]  Qingshun Quinn Li,et al.  Compilation of mRNA Polyadenylation Signals in Arabidopsis Revealed a New Signal Element and Potential Secondary Structures1[w] , 2005, Plant Physiology.

[17]  S. Rudd Expressed sequence tags: alternative or complement to whole genome sequences? , 2003, Trends in plant science.

[18]  Mark L. Blaxter,et al.  PartiGene-constructing partial genomes , 2004, Bioinform..

[19]  L. Wagner,et al.  21. UniGene: A Unified View of the Transcriptome , 2003 .

[20]  Matthew R. Pocock,et al.  The Bioperl toolkit: Perl modules for the life sciences. , 2002, Genome research.

[21]  Hui-Hsien Chou,et al.  DNA sequence quality trimming and vector removal , 2001, Bioinform..