EDITtoTrEMBL: A distributed approach to high-quality automated protein sequence annotation

SUMMARY Many databases in molecular biology face the problem that the ever increasing rate of data production can no longer be handled by traditional methods, especially human curation. Therefore, a number of projects are currently investigating methods for automated sequence annotation. This paper describes the EBI's approach to this problem for protein sequences by integration of arbitrary analysis programs into a distributed and highly flexible environment. Our software framework allows an individual treatment of sequences depending on their particular properties, which is achieved through a high-level description of the preconditions and capabilities of analysing modules. This not only improves the overall performance of the annotation process, as unnecessary steps are avoided, but also enhances its quality since dependencies between different modules are taken into account. We have implemented a prototype and use it in the production of TrEMBL releases. AVAILABILITY Upon request.

[1]  N. Harris,et al.  Genotator: a workbench for sequence annotation. , 1997, Genome research.

[2]  Rodrigo Lopez,et al.  The EMBL Nucleotide Sequence Database , 1999, Nucleic Acids Res..

[3]  T Gaasterland,et al.  MAGPIE: automated genome interpretation. , 1996, Trends in genetics : TIG.

[4]  C. Sander,et al.  Genequiz II: Automatic Function Assignment For Genome Sequence Analysis , 1996 .

[5]  J. Schug,et al.  GAIA: framework annotation of genomic sequence. , 1998, Genome research.

[6]  Alfonso Valencia,et al.  Proceedings of the 5th International Conference on Intelligent Systems for Molecular Biology , 1996 .

[7]  Terri K. Attwood,et al.  The PRINTS protein fingerprint database in its fifth year , 1998, Nucleic Acids Res..

[8]  Rolf Apweiler,et al.  A novel method for automatic and reliable functional annotation of proteins , 1998, German Conference on Bioinformatics.

[9]  T. Hubbard,et al.  Using neural networks for prediction of the subcellular location of proteins. , 1998, Nucleic acids research.

[10]  Rolf Apweiler,et al.  The SWISS-PROT protein sequence data bank and its supplement TrEMBL , 1997, Nucleic Acids Res..

[11]  Amos Bairoch,et al.  The PROSITE database, its status in 1997 , 1997, Nucleic Acids Res..

[12]  Amos Bairoch,et al.  The ENZYME data bank in 1995 , 1996, Nucleic Acids Res..

[13]  T P Flores,et al.  The European Bioinformatics Institute (EBI) databases. , 1996, Nucleic acids research.

[14]  Jeffrey D. Uuman Principles of database and knowledge- base systems , 1989 .

[15]  H. Mewes,et al.  Protein structural classes in five complete genomes , 1997, Nature Structural Biology.

[16]  Patricia Rodriguez-Tomé,et al.  The European Bioinformatics Institute (EBI) databases , 1994, Nucleic Acids Res..

[17]  Erik L. L. Sonnhammer,et al.  A Hidden Markov Model for Predicting Transmembrane Helices in Protein Sequences , 1998, ISMB.

[18]  A. Bairoch The ENZYME data bank. , 1993, Nucleic acids research.

[19]  Sean R. Eddy,et al.  Pfam: multiple sequence alignments and HMM-profiles of protein domains , 1998, Nucleic Acids Res..