Protein Sequence Annotation in the Genome Era: The Annotation Concept of SWISS-PROT + TREMBL

SWISS-PROT is a curated protein sequence database which strives to provide a high level of annotation, a minimal level of redundancy and high level of integration with other databases. Ongoing genome sequencing projects have dramatically increased the number of protein sequences to be incorporated into SWISS-PROT. Since we do not want to dilute the quality standards of SWISS-PROT by incorporating sequences without proper sequence analysis and annotation, we cannot speed up the incorporation of new incoming data indefinitely. However, as we also want to make the sequences available as fast as possible, we introduced TREMBL (TRanslation of EMBL nucleotide sequence database), a supplement to SWISS-PROT. TREMBL consists of computer-annotated entries in SWISS-PROT format derived from the translation of all coding sequences (CDS) in the EMBL nucleotide sequence database, except for CDS already included in SWISS-PROT. While TREMBL is already of immense value, its computer-generated annotation does not match the quality of SWISS-PROTs. The main difference is in the protein functional information attached to sequences. With this in mind, we are dedicating substantial effort to develop and apply computer methods to enhance the functional information attached to TREMBL entries.

[1]  Thure Etzold,et al.  SRS - an indexing and retrieval tool for flat file data libraries , 1993, Comput. Appl. Biosci..

[2]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[3]  Gad M. Landau,et al.  Efficient String Matching with k Mismatches , 2018, Theor. Comput. Sci..

[4]  D. Lipman,et al.  Improved tools for biological sequence comparison. , 1988, Proceedings of the National Academy of Sciences of the United States of America.

[5]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[6]  Amos Bairoch,et al.  The PROSITE database, its status in 1997 , 1997, Nucleic Acids Res..

[7]  Marie-Paule Lefranc,et al.  IMGT, the international ImMunoGeneTics database. , 1997, Nucleic acids research.

[8]  Christus,et al.  A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins , 2022 .

[9]  Gaston H. Gonnet,et al.  Efficient Text Searching of Regular Expressions , 1989, WADS.

[10]  Judith A. Blake,et al.  The Mouse Genome Database (MGD). A comprehensive public resource of genetic, phenotypic and genomic data. The Mouse Genome Informatics Group , 1997, Nucleic Acids Res..

[11]  Jean-Jacques Codani,et al.  LASSAP, a LArge Scale Sequence compArison Package , 1997, Comput. Appl. Biosci..

[12]  Robert S. Boyer,et al.  A fast string searching algorithm , 1977, CACM.

[13]  David R. Gilbert,et al.  FlyBase: a Drosophila database. The FlyBase consortium , 1997, Nucleic Acids Res..

[14]  Stanley Letovsky,et al.  GDB: the Human Genome Database , 1998, Nucleic Acids Res..

[15]  Rolf Apweiler,et al.  The SWISS-PROT protein sequence data bank and its supplement TrEMBL , 1997, Nucleic Acids Res..

[16]  Marie-Paule Lefranc,et al.  IMGT, the international ImMunoGeneTics database , 1997, Nucleic Acids Res..

[17]  Biological Laboratories Divinity Avenue Cambridge Ma Usa. FlyBase FlyBase: a Drosophila database. , 1998, Nucleic acids research.

[18]  Thomas G. Szymanski,et al.  A fast algorithm for computing longest common subsequences , 1977, CACM.

[19]  Yoshio Tateno,et al.  DNA Data Bank of Japan in the age of information biology , 1997, Nucleic Acids Res..

[20]  Stanley Letovsky,et al.  The GDB Human Genome Database Anno 1997 , 1997, Nucleic Acids Res..

[21]  Amos Bairoch,et al.  The ENZYME data bank in 1995 , 1996, Nucleic Acids Res..