ParPEST: a pipeline for EST data analysis based on parallel computing

BackgroundExpressed Sequence Tags (ESTs) are short and error-prone DNA sequences generated from the 5' and 3' ends of randomly selected cDNA clones. They provide an important resource for comparative and functional genomic studies and, moreover, represent a reliable information for the annotation of genomic sequences. Because of the advances in biotechnologies, ESTs are daily determined in the form of large datasets. Therefore, suitable and efficient bioinformatic approaches are necessary to organize data related information content for further investigations.ResultsWe implemented ParPEST (Par allel P rocessing of EST s), a pipeline based on parallel computing for EST analysis. The results are organized in a suitable data warehouse to provide a starting point to mine expressed sequence datasets. The collected information is useful for investigations on data quality and on data information content, enriched also by a preliminary functional annotation.ConclusionThe pipeline presented here has been developed to perform an exhaustive and reliable analysis on EST data and to provide a curated set of information based on a relational database. Moreover, it is designed to reduce execution time of the specific steps required for a complete analysis using distributed processes and parallelized software. It is conceived to run on low requiring hardware components, to fulfill increasing demand, typical of the data used, and scalability at affordable costs.

[1]  Inge Jonassen,et al.  A graph based algorithm for generating EST consensus sequences , 2005, Bioinform..

[2]  Srinivas Aluru,et al.  Efficient clustering of large EST data sets on parallel computers. , 2003, Nucleic acids research.

[3]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[4]  Wei Huang,et al.  EST Pipeline System: Detailed and Automated EST Data Processing and Mining , 2003, Genomics, proteomics & bioinformatics.

[5]  Jennifer W. Weller,et al.  ESTAP-an automated system for the analysis of EST data , 2003, Bioinform..

[6]  Lei Liu,et al.  ESTIMA, a tool for EST management in a multi-project environment , 2004, BMC Bioinformatics.

[7]  John Quackenbush,et al.  TIGR Gene Indices clustering tools (TGICL): a software system for fast clustering of large EST datasets , 2003, Bioinform..

[8]  Winston A Hide,et al.  A comprehensive approach to clustering of expressed human gene sequence: the sequence tag alignment and consensus knowledge base. , 1999, Genome research.

[9]  Lukas Wagner,et al.  A Greedy Algorithm for Aligning DNA Sequences , 2000, J. Comput. Biol..

[10]  J. Jurka Repbase update: a database and an electronic journal of repetitive elements. , 2000, Trends in genetics : TIG.

[11]  Peter Ernst,et al.  ESTAnnotator: a tool for high throughput EST annotation , 2003, Nucleic Acids Res..

[12]  Stephen Rudd,et al.  openSputnik—a database to ESTablish comparative plant genomics using unsaturated sequence collections , 2004, Nucleic Acids Res..

[13]  L. Wagner,et al.  21. UniGene: A Unified View of the Transcriptome , 2003 .

[14]  Hui-Hsien Chou,et al.  DNA sequence quality trimming and vector removal , 2001, Bioinform..

[15]  John Quackenbush,et al.  The TIGR Gene Indices: clustering and assembling EST and known genes and integration with eukaryotic genomes , 2004, Nucleic Acids Res..

[16]  Susumu Goto,et al.  The KEGG resource for deciphering the genome , 2004, Nucleic Acids Res..

[17]  Mark L. Blaxter,et al.  Making sense of EST sequences by CLOBBing them , 2002, BMC Bioinformatics.

[18]  M. Boguski,et al.  dbEST — database for “expressed sequence tags” , 1993, Nature Genetics.

[19]  Amos Bairoch,et al.  The ENZYME database in 2000 , 2000, Nucleic Acids Res..

[20]  Cathy H. Wu,et al.  UniProt: the Universal Protein knowledgebase , 2004, Nucleic Acids Res..

[21]  Jennifer Daub,et al.  Expressed sequence tags: medium-throughput protocols. , 2004, Methods in molecular biology.

[22]  Robert Miller,et al.  STACK: Sequence Tag Alignment and Consensus Knowledgebase , 2001, Nucleic Acids Res..

[23]  L. Wagner,et al.  21. UniGene: A Unified View of the Transcriptome , 2003 .

[24]  Gregory D. Schuler,et al.  ESTablishing a human transcript map , 1995, Nature Genetics.

[25]  Gene Ontology Consortium The Gene Ontology (GO) database and informatics resource , 2003 .

[26]  X. Huang,et al.  CAP3: A DNA sequence assembly program. , 1999, Genome research.

[27]  D. Davison,et al.  d2_cluster: a validated method for clustering EST and full-length cDNAsequences. , 1999, Genome research.