Automated annotation of microbial proteomes in SWISS-PROT

Large-scale sequencing of prokaryotic genomes demands the automation of certain annotation tasks currently manually performed in the production of the SWISS-PROT protein knowledgebase. The HAMAP project, or 'High-quality Automated and Manual Annotation of microbial Proteomes', aims to integrate manual and automatic annotation methods in order to enhance the speed of the curation process while preserving the quality of the database annotation. Automatic annotation is only applied to entries that belong to manually defined orthologous families and to entries with no identifiable similarities (ORFans). Many checks are enforced in order to prevent the propagation of wrong annotation and to spot problematic cases, which are channelled to manual curation. The results of this annotation are integrated in SWISS-PROT, and a website is provided at http://www.expasy.org/sprot/hamap/.

[1]  Amos Bairoch,et al.  PROSITE: A Documented Database Using Patterns and Profiles as Motif Descriptors , 2002, Briefings Bioinform..

[2]  C. Ponting,et al.  Homology-based method for identification of protein repeats using statistical significance estimates. , 2000, Journal of molecular biology.

[3]  A. Lupas,et al.  Predicting coiled coils from protein sequences , 1991, Science.

[4]  S. Henikoff,et al.  Automated assembly of protein blocks for database searching. , 1991, Nucleic acids research.

[5]  P. Bucher,et al.  Improving the sensitivity of the sequence profile method , 1994, Protein science : a publication of the Protein Society.

[6]  T Gaasterland,et al.  MAGPIE: automated genome interpretation. , 1996, Trends in genetics : TIG.

[7]  M. Gribskov,et al.  Profile Analysis , 1970 .

[8]  Alex Bateman,et al.  InterPro: An Integrated Documentation Resource for Protein Families, Domains and Functional Sites , 2002, Briefings Bioinform..

[9]  Miguel A. Andrade-Navarro,et al.  Automated genome sequence analysis and annotation , 1999, Bioinform..

[10]  Rolf Apweiler,et al.  Applications of InterPro in Protein Annotation and Genome Analysis , 2002, Briefings Bioinform..

[11]  Fan Yang,et al.  TIGRFAMs: a protein family resource for the functional identification of proteins , 2001, Nucleic Acids Res..

[12]  B. Barrell,et al.  Deciphering the biology of Mycobacterium tuberculosis from the complete genome sequence , 1998, Nature.

[13]  A Bairoch,et al.  Go hunting in sequence databases but watch out for the traps. , 1996, Trends in genetics : TIG.

[14]  Kevin Karplus,et al.  A Flexible Motif Search Technique Based on Generalized Profiles , 1996, Comput. Chem..

[15]  Maria Jesus Martin,et al.  The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003 , 2003, Nucleic Acids Res..

[16]  P Bucher,et al.  The FHA domain: a putative nuclear signalling domain found in protein kinases and transcription factors. , 1995, Trends in biochemical sciences.

[17]  Rolf Apweiler,et al.  Representation of functional information in the SWISS-PROT Data Bank , 1999, Bioinform..

[18]  M. Gribskov,et al.  [9] Profile analysis , 1990 .

[19]  A. Krogh,et al.  Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes. , 2001, Journal of molecular biology.

[20]  Dmitrij Frishman,et al.  Functional and structural genomics using PEDANT , 2001, Bioinform..

[21]  Michael Y. Galperin,et al.  Sources of systematic error in functional annotation of genomes: domain rearrangement, non-orthologous gene displacement, and operon disruption , 1998, Silico Biol..

[22]  S. Brunak,et al.  SHORT COMMUNICATION Identification of prokaryotic and eukaryotic signal peptides and prediction of their cleavage sites , 1997 .

[23]  J. Thompson,et al.  CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. , 1994, Nucleic acids research.